Overview

Precision medicine is a rapidly growing field in statistics, machine learning and clinical practice. It takes individual heterogeneity, environment, and lifestyles into consideration while making decisions to improve human health. Recent advances in large-scale biomedical databases and computational devices dramatically improve the prospect of optimizing an individual’s treatment. The launch of the Precision Medicine Initiative sparks much more research in discovering individualized treatment.

Many statistical models have been developed to learn the optimal treatment decision based on a set of observed data. In this tutorial, we give an example of the Virtual Twins model, proposed by Foster et al. (2011). The data analysis demonstration is mainly based on the vignettes of the aVirtualTwins R package. In addition to running the virtual twins model, we also introduce two machine learning algorithms: random forests and CART (classification and regression trees). While reading the related papers and working through this analysis, you may want to consider the the critical questions for data analysis related to the DSP Course Competencies.

Using RMarkdown

This .html file is created using R Markdown. This is a feature implemented in RStudio, which you can download from here. You also need to install R, and a few R packages (with code provided below). You should download the corresponding .rmd file provided on our course website and save that to your local folder. Then open the file should automatically open up a RStudio window. We have went through these steps during the population health course during your first summer.

Environment Setup

The cell below sets up the packages for running the analysis. If you do not already have these package, then remove # in front of install.packages(). This will install the required packages. However, if you already have them, then only the library() part is necessary.

  #install.packages("randomForest")
  #install.packages("rpart")
  #install.packages("rpart.plot")
  
  library(randomForest)
  library(rpart)
  library(rpart.plot)

Personalized Medicine Example: Sepsis Treatment

Personalized medicine draws a lot of attention in medical research. The goal of personalized medicine is to make a tailored decision for each patient, such that his/her clinical outcome can be optimized.

Let’s consider a simulated data set derived from the SIDES package. The data in .csv format can be downloaded from this google link. In this data set, 470 patients and 14 variables are collected. The variables are listed below.

For each patient, sepsis was observed during their hospital stay. Hence, one of the two treatments (indicated by variable THERAPY) must be chosen to prevent further adverse events. After the treatment, the patient’s health outcome (Health) was measured, with a larger value being the better outcome. The BEST variable is a doctor suggested best treatment, which is not observed. This can be regarded as the unknown truth.

Run the cell below to load the data set and display the first few rows. It will only take a few seconds to complete. Make sure that you have the working directory setup correctly. You can do this by putting your .rmd file and the .csv file in the same folder, then open your .rmd file to pop up RStudio.

  Sepsis <- read.csv("Sepsis.csv")
  # remove the first column, which is observation ID
  Sepsis = Sepsis[, -1]
  head(Sepsis)
##      Health THERAPY PRAPACHE    AGE BLGCS ORGANNUM   BLIL6  BLLPLAT  BLLBILI
## 1 -3.188260       0       19 42.921    15        1  301.80 191.0000 2.913416
## 2 -3.546312       0       48 68.818    11        2  118.90 264.1565 0.400000
## 3  1.188689       1       20 68.818    15        2   92.80 123.0000 5.116471
## 4  2.693554       1       19 33.174    14        2 1232.00 244.0000 3.142092
## 5  3.007590       0       48 46.532     3        4 2568.00  45.0000 4.052668
## 6  3.870876       1       21 56.098    14        1  162.65 137.0000 0.500000
##   BLLCREAT TIMFIRST BLADL blSOFA BEST
## 1 1.000000    17.17     0   5.00    1
## 2 1.100000    17.17     5  10.00    1
## 3 1.000000    10.00     1   7.50    0
## 4 1.200000    17.17     0   6.25    1
## 5 3.000000    10.00     0  12.00    1
## 6 4.662556    10.00     0   8.75    0

Virtual Twins: Introduction and Tuning

We would like to use the data set to predict the best treatment for a new patient. A strategy called Virtual Twins was proposed by Foster et al. (2011) to tackle this problem. Here we consider a simpler version of the method.

We will fit two random forests to model the outcome Health: one model is based on all patients who received treatment 1, and the other model is based on all patients who received treatment 0. Denote these two models as \(\widehat{f}_1(x)\) and \(\widehat f_0(x)\), respectively. When a new patient arrives, we will use the two models to calculate the expected health status for the new patient, and we will suggest the treatment associated with the model that gives a larger prediction value. In other words, for a new \(x^{*}\), we compare \(\widehat f_1(x^{*})\) and \(\widehat f_0(x^{*})\) and suggest the treatment whose model produced the higher value. We will build the models with the random forest algorithm. Our first step is to tune the mtry and nodesize parameters for the data set. Read the code and comments in the cell below. Run the cell, which will take about seven minutes to complete.

  # n_simulations is the number of random simulations we will run
  n_simulations <- 100
  
  # n_patients is the number of patients
  n_patients <- nrow(Sepsis)
  
  # all_tune is a dataframe with one row for each combination of nodesize and mtry
  # that we want to evaluate
  # you can find guidance on values for these and other random forest parameters
  # in machine learning literature
  all_tune <- expand.grid(
              "nodesize" = c(1, 5, 10),
              "mtry" = c(1, 4, 12))
  
  # error_vals and error_vals_2 are matrices of the error values from each simulation 
  error_vals <- matrix(NA, nrow(all_tune), n_simulations)
  error_vals_2 <- matrix(NA, nrow(all_tune), n_simulations)
  
  # initialize the random seed before starting the simulations for reproducibility
  set.seed(1)
  
  # for each simulation...
  for (k in 1:n_simulations) {
    # create a training set consisting of 75% of the patients and a testing set
    # consisting of the remaining 25% of the patients
    intrain <- sample(1:n_patients, 0.75*n_patients)
  
    train <- Sepsis[intrain, ]
    test <- Sepsis[-intrain, ]
  
    # for each combination of nodesize and mtry...
    for (j in 1:nrow(all_tune)) {
      # build the model for treatment 0 and the model for treatment 1
      # note the following for model0 and model1:
      # - BEST is not used in building the model
      # - the models are built only from the patients in the training set (not the
      #   testing test) who received the therapy associated with the model
      # - mtry and nodesize come from all_train
      model0 <- randomForest(Health ~ . - BEST, data = train[train$THERAPY == 0, ], 
                             mtry = all_tune$mtry[j], nodesize = all_tune$nodesize[j])
      
      model1 <- randomForest(Health ~ . - BEST, data = train[train$THERAPY == 1, ], 
                             mtry = all_tune$mtry[j], nodesize = all_tune$nodesize[j])
  
      # now use model0 and model1 to predict health outcomes on each of the
      # patients in the testing set
      test0 <- predict(model0, test)
      test1 <- predict(model1, test)
  
      # for each patient in the testing set, recommend whichever treatment has the
      # greater score in the model predictions
      suggest <- (test1 > test0)
      
      # compare the suggestions to the BEST (i.e., doctor-recommended) treatment
      # and record the mean error
      # Please note that in real practical situation, you do not have this label.
      # Hence the only judgment you have would be how well the model fits the health outcome. 
      error_vals[j, k] <- mean(suggest != test$BEST)
      
      # When the best treatment if not available, things are a bit tricky. 
      # we could use the prediction error of health for those who took the treatment in the test data
      # However, this may produce bias if this is not a randomized trial
      
      error_vals_2[j, k] <- mean((ifelse(test$THERAPY, test1, test0) - test$Health)^2)
    }
  }
  
  # summarize the error for each combination of nodesize and mtry evaluated
  cbind(all_tune, "Mean Error" = rowMeans(error_vals))
##   nodesize mtry Mean Error
## 1        1    1  0.2486441
## 2        5    1  0.2467797
## 3       10    1  0.2410169
## 4        1    4  0.2183898
## 5        5    4  0.2186441
## 6       10    4  0.2106780
## 7        1   12  0.2171186
## 8        5   12  0.2142373
## 9       10   12  0.2122034
  cbind(all_tune, "Mean Error" = rowMeans(error_vals_2))
##   nodesize mtry Mean Error
## 1        1    1   4.208186
## 2        5    1   4.206669
## 3       10    1   4.207977
## 4        1    4   4.313722
## 5        5    4   4.299687
## 6       10    4   4.276398
## 7        1   12   4.432126
## 8        5   12   4.422320
## 9       10   12   4.383382

Interpretation

Which values of nodesize and mtry seem best? What might that say about the underlying truth in the data set?

Single-Tree Model

As a second step, we will construct a single-tree model (CART) to represent the choice of best treatment.

Read the code and comments in the cell below. Run the cell, which will take less than a second to complete.

  # build a new model0 and model1 using the optimal mtry and nodesize we found above
  # this time, there's no split into training and tests sets; all patients for each
  # therapy are included, but we still do not use the BEST column
  model0 <- randomForest(Health ~ . - BEST, data =  Sepsis[Sepsis$THERAPY == 0, ], mtry = 12, nodesize = 10)
  model1 <- randomForest(Health ~ . - BEST, data =  Sepsis[Sepsis$THERAPY == 1, ], mtry = 12, nodesize = 10)
  
  # calculate the predicted health for all patients according to model0 and model1
  pred0 <- predict(model0, Sepsis)
  pred1 <- predict(model1, Sepsis)
  
  # determine the suggested treatment according to the predicted health
  Sepsis$FitBest <- (pred1 > pred0)
  
  # fit a decision tree to predict FitBest, excluding BEST, Health, and THERAPY
  rpart.fit <- rpart(as.factor(FitBest) ~ . - BEST - Health - THERAPY, data = Sepsis)
  
  # in the coming cells, we will prune the tree
  # start by plotting the cross-validation relative error at different tree sizes
  plotcp(rpart.fit)

Run the cell below, which will take less than a second, to view the plot data in tabular form.

  rpart.fit$cptable
##           CP nsplit rel error    xerror       xstd
## 1 0.44247788      0 1.0000000 1.0840708 0.04792001
## 2 0.09292035      1 0.5575221 0.5575221 0.04249200
## 3 0.02654867      2 0.4646018 0.5044248 0.04111687
## 4 0.01991150      4 0.4115044 0.4690265 0.04009094
## 5 0.01769912      6 0.3716814 0.4557522 0.03968207
## 6 0.01000000      7 0.3539823 0.4424779 0.03925939

When pruning, we want to choose a value of cp that balances the tree complexity against the error performance. According to the plotcp documentation, “A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.” What is the intuition behind that guidance? What value of cp is suggested by the plot and table above?

Run the cell below, which will take less than a second, to build and view a pruned tree.

  rpart.fit <- rpart(as.factor(FitBest) ~ . - BEST - Health - THERAPY, data = Sepsis, cp = 0.023)

  rpart.plot(rpart.fit)

Project Extensions and Future Directions

There are some limitations of the Virtual Twins model. First, it requires to accurately model the outcome variable using the presented co-variates so that the treatment decision can be made by comparing the predicted outcome. However, this may not always be possible since in many cancer studies, the number of subjects is much smaller than the number of variables (e.g. genes), and the modeling of outcomes may be very difficult. Instead, one may only interested in modeling the treatment decision (which treatment is better) which could be a much simpler model. For example, to decide if one can benefit from immunotherapy, a patient is usually tested for the PD-1/PD-L1 biomarker. Hence a single biomarker can already decide the treatment, to a large extent, while predicting a patient’s cancer survival after taking the treatment is much more difficult. Some statistical models have already been developed to directly learn such treatment decisions

Another limitation is that the Virtual Twins model can only deal with binary or multiple treatment labels, while it cannot be used to decide a drug dosage. To this extend, some models have been proposed such as

Some other possibilities

  • Subgroup analysis models provided on Biopharmnet.
  • Dynamic treatment Regimes consists of a sequence of decision rules, one per stage of intervention, that dictate how to individualize treatments to patients based on evolving treatment and co-variate history.