Precision medicine is a rapidly growing field in statistics, machine learning and clinical practice. It takes individual heterogeneity, environment, and lifestyles into consideration while making decisions to improve human health. Recent advances in large-scale biomedical databases and computational devices dramatically improve the prospect of optimizing an individual’s treatment. The launch of the Precision Medicine Initiative sparks much more research in discovering individualized treatment.
Many statistical models have been developed to learn the optimal treatment decision based on a set of observed data. In this tutorial, we give an example of the Virtual Twins model, proposed by Foster et al. (2011). The data analysis demonstration is mainly based on the vignettes of the aVirtualTwins
R package. In addition to running the virtual twins model, we also introduce two machine learning algorithms: random forests and CART (classification and regression trees). While reading the related papers and working through this analysis, you may want to consider the the critical questions for data analysis related to the DSP Course Competencies.
This .html
file is created using R Markdown. This is a feature implemented in RStudio, which you can download from here. You also need to install R
, and a few R
packages (with code provided below). You should download the corresponding .rmd
file provided on our course website and save that to your local folder. Then open the file should automatically open up a RStudio window. We have went through these steps during the population health course during your first summer.
The cell below sets up the packages for running the analysis. If you do not already have these package, then remove #
in front of install.packages()
. This will install the required packages. However, if you already have them, then only the library()
part is necessary.
#install.packages("randomForest")
#install.packages("rpart")
#install.packages("rpart.plot")
library(randomForest)
library(rpart)
library(rpart.plot)
Personalized medicine draws a lot of attention in medical research. The goal of personalized medicine is to make a tailored decision for each patient, such that his/her clinical outcome can be optimized.
Let’s consider a simulated data set derived from the SIDES package. The data in .csv
format can be downloaded from this google link. In this data set, 470 patients and 14 variables are collected. The variables are listed below.
Health
: Health outcome (larger the better)THERAPY
: 1 for active treatment, 0 for the control treatmentTIMFIRST
: Time from first sepsis-organ fail to start drugAGE
: Patient age in yearsBLLPLAT
: Baseline local plateletsblSOFA
: Sum of baseline sofa score (cardiovascular, hematology, hepatorenal, and respiration scores)BLLCREAT
: Base creatinineORGANNUM
: Number of baseline organ failuresPRAPACHE
: Pre-infusion apache-ii scoreBLGCS
: Base GLASGOW coma scale scoreBLIL6
: Baseline serum IL-6 concentrationBLADL
: Baseline activity of daily living scoreBLLBILI
: Baseline local bilirubinBEST
: The true best treatment suggested by doctors. You should not use this variable when fitting models!For each patient, sepsis was observed during their hospital stay. Hence, one of the two treatments (indicated by variable THERAPY
) must be chosen to prevent further adverse events. After the treatment, the patient’s health outcome (Health
) was measured, with a larger value being the better outcome. The BEST
variable is a doctor suggested best treatment, which is not observed. This can be regarded as the unknown truth.
Run the cell below to load the data set and display the first few rows. It will only take a few seconds to complete. Make sure that you have the working directory setup correctly. You can do this by putting your .rmd
file and the .csv
file in the same folder, then open your .rmd
file to pop up RStudio.
Sepsis <- read.csv("Sepsis.csv")
# remove the first column, which is observation ID
Sepsis = Sepsis[, -1]
head(Sepsis)
## Health THERAPY PRAPACHE AGE BLGCS ORGANNUM BLIL6 BLLPLAT BLLBILI
## 1 -3.188260 0 19 42.921 15 1 301.80 191.0000 2.913416
## 2 -3.546312 0 48 68.818 11 2 118.90 264.1565 0.400000
## 3 1.188689 1 20 68.818 15 2 92.80 123.0000 5.116471
## 4 2.693554 1 19 33.174 14 2 1232.00 244.0000 3.142092
## 5 3.007590 0 48 46.532 3 4 2568.00 45.0000 4.052668
## 6 3.870876 1 21 56.098 14 1 162.65 137.0000 0.500000
## BLLCREAT TIMFIRST BLADL blSOFA BEST
## 1 1.000000 17.17 0 5.00 1
## 2 1.100000 17.17 5 10.00 1
## 3 1.000000 10.00 1 7.50 0
## 4 1.200000 17.17 0 6.25 1
## 5 3.000000 10.00 0 12.00 1
## 6 4.662556 10.00 0 8.75 0
We would like to use the data set to predict the best treatment for a new patient. A strategy called Virtual Twins was proposed by Foster et al. (2011) to tackle this problem. Here we consider a simpler version of the method.
We will fit two random forests to model the outcome Health
: one model is based on all patients who received treatment 1, and the other model is based on all patients who received treatment 0. Denote these two models as \(\widehat{f}_1(x)\) and \(\widehat f_0(x)\), respectively. When a new patient arrives, we will use the two models to calculate the expected health status for the new patient, and we will suggest the treatment associated with the model that gives a larger prediction value. In other words, for a new \(x^{*}\), we compare \(\widehat f_1(x^{*})\) and \(\widehat f_0(x^{*})\) and suggest the treatment whose model produced the higher value. We will build the models with the random forest algorithm. Our first step is to tune the mtry
and nodesize
parameters for the data set. Read the code and comments in the cell below. Run the cell, which will take about seven minutes to complete.
# n_simulations is the number of random simulations we will run
n_simulations <- 100
# n_patients is the number of patients
n_patients <- nrow(Sepsis)
# all_tune is a dataframe with one row for each combination of nodesize and mtry
# that we want to evaluate
# you can find guidance on values for these and other random forest parameters
# in machine learning literature
all_tune <- expand.grid(
"nodesize" = c(1, 5, 10),
"mtry" = c(1, 4, 12))
# error_vals and error_vals_2 are matrices of the error values from each simulation
error_vals <- matrix(NA, nrow(all_tune), n_simulations)
error_vals_2 <- matrix(NA, nrow(all_tune), n_simulations)
# initialize the random seed before starting the simulations for reproducibility
set.seed(1)
# for each simulation...
for (k in 1:n_simulations) {
# create a training set consisting of 75% of the patients and a testing set
# consisting of the remaining 25% of the patients
intrain <- sample(1:n_patients, 0.75*n_patients)
train <- Sepsis[intrain, ]
test <- Sepsis[-intrain, ]
# for each combination of nodesize and mtry...
for (j in 1:nrow(all_tune)) {
# build the model for treatment 0 and the model for treatment 1
# note the following for model0 and model1:
# - BEST is not used in building the model
# - the models are built only from the patients in the training set (not the
# testing test) who received the therapy associated with the model
# - mtry and nodesize come from all_train
model0 <- randomForest(Health ~ . - BEST, data = train[train$THERAPY == 0, ],
mtry = all_tune$mtry[j], nodesize = all_tune$nodesize[j])
model1 <- randomForest(Health ~ . - BEST, data = train[train$THERAPY == 1, ],
mtry = all_tune$mtry[j], nodesize = all_tune$nodesize[j])
# now use model0 and model1 to predict health outcomes on each of the
# patients in the testing set
test0 <- predict(model0, test)
test1 <- predict(model1, test)
# for each patient in the testing set, recommend whichever treatment has the
# greater score in the model predictions
suggest <- (test1 > test0)
# compare the suggestions to the BEST (i.e., doctor-recommended) treatment
# and record the mean error
# Please note that in real practical situation, you do not have this label.
# Hence the only judgment you have would be how well the model fits the health outcome.
error_vals[j, k] <- mean(suggest != test$BEST)
# When the best treatment if not available, things are a bit tricky.
# we could use the prediction error of health for those who took the treatment in the test data
# However, this may produce bias if this is not a randomized trial
error_vals_2[j, k] <- mean((ifelse(test$THERAPY, test1, test0) - test$Health)^2)
}
}
# summarize the error for each combination of nodesize and mtry evaluated
cbind(all_tune, "Mean Error" = rowMeans(error_vals))
## nodesize mtry Mean Error
## 1 1 1 0.2486441
## 2 5 1 0.2467797
## 3 10 1 0.2410169
## 4 1 4 0.2183898
## 5 5 4 0.2186441
## 6 10 4 0.2106780
## 7 1 12 0.2171186
## 8 5 12 0.2142373
## 9 10 12 0.2122034
cbind(all_tune, "Mean Error" = rowMeans(error_vals_2))
## nodesize mtry Mean Error
## 1 1 1 4.208186
## 2 5 1 4.206669
## 3 10 1 4.207977
## 4 1 4 4.313722
## 5 5 4 4.299687
## 6 10 4 4.276398
## 7 1 12 4.432126
## 8 5 12 4.422320
## 9 10 12 4.383382
Which values of nodesize
and mtry
seem best? What might that say about the underlying truth in the data set?
As a second step, we will construct a single-tree model (CART) to represent the choice of best treatment.
Read the code and comments in the cell below. Run the cell, which will take less than a second to complete.
# build a new model0 and model1 using the optimal mtry and nodesize we found above
# this time, there's no split into training and tests sets; all patients for each
# therapy are included, but we still do not use the BEST column
model0 <- randomForest(Health ~ . - BEST, data = Sepsis[Sepsis$THERAPY == 0, ], mtry = 12, nodesize = 10)
model1 <- randomForest(Health ~ . - BEST, data = Sepsis[Sepsis$THERAPY == 1, ], mtry = 12, nodesize = 10)
# calculate the predicted health for all patients according to model0 and model1
pred0 <- predict(model0, Sepsis)
pred1 <- predict(model1, Sepsis)
# determine the suggested treatment according to the predicted health
Sepsis$FitBest <- (pred1 > pred0)
# fit a decision tree to predict FitBest, excluding BEST, Health, and THERAPY
rpart.fit <- rpart(as.factor(FitBest) ~ . - BEST - Health - THERAPY, data = Sepsis)
# in the coming cells, we will prune the tree
# start by plotting the cross-validation relative error at different tree sizes
plotcp(rpart.fit)
Run the cell below, which will take less than a second, to view the plot data in tabular form.
rpart.fit$cptable
## CP nsplit rel error xerror xstd
## 1 0.44247788 0 1.0000000 1.0840708 0.04792001
## 2 0.09292035 1 0.5575221 0.5575221 0.04249200
## 3 0.02654867 2 0.4646018 0.5044248 0.04111687
## 4 0.01991150 4 0.4115044 0.4690265 0.04009094
## 5 0.01769912 6 0.3716814 0.4557522 0.03968207
## 6 0.01000000 7 0.3539823 0.4424779 0.03925939
When pruning, we want to choose a value of cp
that balances the tree complexity against the error performance. According to the plotcp documentation, “A good choice of cp
for pruning is often the leftmost value for which the mean lies below the horizontal line.” What is the intuition behind that guidance? What value of cp
is suggested by the plot and table above?
Run the cell below, which will take less than a second, to build and view a pruned tree.
rpart.fit <- rpart(as.factor(FitBest) ~ . - BEST - Health - THERAPY, data = Sepsis, cp = 0.023)
rpart.plot(rpart.fit)
There are some limitations of the Virtual Twins model. First, it requires to accurately model the outcome variable using the presented co-variates so that the treatment decision can be made by comparing the predicted outcome. However, this may not always be possible since in many cancer studies, the number of subjects is much smaller than the number of variables (e.g. genes), and the modeling of outcomes may be very difficult. Instead, one may only interested in modeling the treatment decision (which treatment is better) which could be a much simpler model. For example, to decide if one can benefit from immunotherapy, a patient is usually tested for the PD-1/PD-L1 biomarker. Hence a single biomarker can already decide the treatment, to a large extent, while predicting a patient’s cancer survival after taking the treatment is much more difficult. Some statistical models have already been developed to directly learn such treatment decisions
Another limitation is that the Virtual Twins model can only deal with binary or multiple treatment labels, while it cannot be used to decide a drug dosage. To this extend, some models have been proposed such as
Some other possibilities