Overview

High-throughput technologies have fueled the growth of genomic, transcriptomic, epigenomic, and proteomic datasets related to healthy, diseased, and perturbed states of cellular tissues. There are many important challenges and opportunities (2020 viewpoint) that need to be addressed as the insights possible from these types of data analyses continue their integration into clinical practice. For this project, we have selected a tutorial that mimics the recent KnowEnG publication, which applies various machine learning and graph mining techniques to the downstream analysis of data from The Cancer Genome Atlas (TCGA). The main purposes of this tutorial are to show an approach to cluster cancer patients using data from multiple experimental assays and how to use prior knowledge of gene relationships to identify the transcripts that relate to specific cancer subtypes. The TCGA data, KnowEnG knowledge-guided analysis pipelines, and ideas presented in the tutorial may be useful in guiding potential data science projects.

Tutorial Highlights

Running the Tutorial

In order to run this tutorial, users will need to copy the data and the Jupyter notebook to their university Google account. This will allow them to view the code and run the multi-step KnowEnG analysis pipeline for clustering the data, finding differentially expressed transcripts per cancer subtype, and then identifying the related pathways. During the tutorial, students will create their own copy of pre-processed TCGA data with basic clinical annotations. Completing the tutorial will involve several steps that together take over an hour to run and finish. The Jupyter notebook and data may be used as a template for any project extensions. Please fill out the [cloud access request form] if you would like Google Cloud Platform access to modify the Jupyter notebooks for larger simulations beyond the capabilities of the free Google Colab accounts.

Project Extensions and Future Directions

There are many possibilities to extend beyond the tutorial and create a data science project. Projects may attempt to apply these analysis techniques to explore specific hypotheses about alternative disease or tissue types. A wealth of public -omics datasets are available at several different project portals, such as GDC, cBioPortal, CCLE, GEO, dbGaP, and GTEx. Alternatively, projects may choose to investigate the effect of knowledge-guide analysis on the results, and the use of alternative or custom knowledge-networks. One could modify and/or substitute the KnowEnG pipeline steps with alterative or extended statistical, machine learning, or network analysis approaches and reanalyze the same TCGA data. Projects also may be designed to build additional results visualizations and views that capture important features of the data and knowledge networks. Alternatively, teams could work to import/publish tools into genomics analysis cloud platforms (such as Galaxy, SevenBridges Cancer Genomics Cloud CGC, or Terra) and analyze cloud-based -omics datasets available there.

Potential Project Resources