Precision medicine is a rapidly growing field in statistics, machine learning and clinical practice. It takes individual heterogeneity, environment, and lifestyles into consideration while making decisions to improve human health. Recent advances in large-scale biomedical databases and computational devices dramatically improve the prospect of optimizing an individual’s treatment. The launch of the Precision Medicine Initiative sparks much more research in discovering individualized treatment.
Many statistical models have been developed to learn the optimal
treatment decision based on a set of observed data. In this tutorial, we
give an example of the Virtual Twins model, proposed by Foster et al. (2011). The
data analysis demonstration is mainly based on the vignettes
of the aVirtualTwins
R package. In addition to running the
virtual twins model, we also introduce two machine learning algorithms:
random forests and CART (classification and regression trees).
In order to run this tutorial, users will need to copy the Jupyter
notebook to their university Google account. This tutorial uses
R
as the programming language. If you are not already
familiar with R
, please refer to the tutorial
given in the Population Health course. Completing the tutorial will
involve several steps that together take under an hour to run and
finish. The Jupyter notebook may be used as a template for any project
extensions. Please fill out the [cloud access
request form] if you would like Google Cloud Platform access to
modify the Jupyter notebooks for larger simulations beyond the
capabilities of the free Google Colab accounts.
.ipynb
file) for the tutorial can be found here.Alternatively, if you are familiar with RStudio and RMarkdown, you
can use this .rmd
file
(link),
which generates this report file. This
will walk you through the same step-by-step instruction. The data is
available here.
Keep in mind that this version is more suitable for smaller scale data
analysis (likely within a few hundred MB). For large scale problems,
consider the Google Cloud option.
There are some limitations of the Virtual Twins model. First, it requires to accurately model the outcome variable using the presented co-variates so that the treatment decision can be made by comparing the predicted outcome. However, this may not always be possible since in many cancer studies, the number of subjects is much smaller than the number of variables (e.g. genes), and the modeling of outcomes may be very difficult. Instead, one may only interested in modeling the treatment decision (which treatment is better) which could be a much simpler model. For example, to decide if one can benefit from immunotherapy, a patient is usually tested for the PD-1/PD-L1 biomarker. Hence a single biomarker can already decide the treatment, to a large extent, while predicting a patient’s cancer survival after taking the treatment is much more difficult. Some statistical models have already been developed to directly learn such treatment decisions
Another limitation is that the Virtual Twins model can only deal with binary or multiple treatment labels, while it cannot be used to decide a drug dosage. To this extend, some models have been proposed such as
Some other possibilities