Overview

Precision medicine is a rapidly growing field in statistics, machine learning and clinical practice. It takes individual heterogeneity, environment, and lifestyles into consideration while making decisions to improve human health. Recent advances in large-scale biomedical databases and computational devices dramatically improve the prospect of optimizing an individual’s treatment. The launch of the Precision Medicine Initiative sparks much more research in discovering individualized treatment.

Many statistical models have been developed to learn the optimal treatment decision based on a set of observed data. In this tutorial, we give an example of the Virtual Twins model, proposed by Foster et al. (2011). The data analysis demonstration is mainly based on the vignettes of the aVirtualTwins R package. In addition to running the virtual twins model, we also introduce two machine learning algorithms: random forests and CART (classification and regression trees).

Tutorial Highlights

Running the Tutorial

In order to run this tutorial, users will need to copy the Jupyter notebook to their university Google account. This tutorial uses R as the programming language. If you are not already familiar with R, please refer to the tutorial given in the Population Health course. Completing the tutorial will involve several steps that together take under an hour to run and finish. The Jupyter notebook may be used as a template for any project extensions. Please fill out the [cloud access request form] if you would like Google Cloud Platform access to modify the Jupyter notebooks for larger simulations beyond the capabilities of the free Google Colab accounts.

Alternatively, if you are familiar with RStudio and RMarkdown, you can use this .rmd file (link), which generates this report file. This will walk you through the same step-by-step instruction. The data is available here. Keep in mind that this version is more suitable for smaller scale data analysis (likely within a few hundred MB). For large scale problems, consider the Google Cloud option.

Project Extensions and Future Directions

There are some limitations of the Virtual Twins model. First, it requires to accurately model the outcome variable using the presented co-variates so that the treatment decision can be made by comparing the predicted outcome. However, this may not always be possible since in many cancer studies, the number of subjects is much smaller than the number of variables (e.g. genes), and the modeling of outcomes may be very difficult. Instead, one may only interested in modeling the treatment decision (which treatment is better) which could be a much simpler model. For example, to decide if one can benefit from immunotherapy, a patient is usually tested for the PD-1/PD-L1 biomarker. Hence a single biomarker can already decide the treatment, to a large extent, while predicting a patient’s cancer survival after taking the treatment is much more difficult. Some statistical models have already been developed to directly learn such treatment decisions

Another limitation is that the Virtual Twins model can only deal with binary or multiple treatment labels, while it cannot be used to decide a drug dosage. To this extend, some models have been proposed such as

Some other possibilities

Potential Project Resources