Course Competencies

The Data Science Project is intended to test and develop your abilities in the following six areas. For each area, we have provided some sample questions to ask and consider while completing a data science project. These questions are intended as a guide for critical thinking as you read through or conduct data analysis.

1. Know Your Problem

Problem Definitions: What exactly do you want to find out? What is the specific question? How and for who does the answer impact clinical/medical decision-making?
Metric Selections: What is the best metric/measure for answering the specific question? What are alternative metrics? What are the advantages and shortcomings of the best metric?

2. Know Your Data

Data Provenance: Where does the data come from? How and when was it obtained? (Velocity) What assumptions were made in its acquisition? What aspects of reality were captured well/poorly? How was the data transformed for sharing and preparation for data analysis? What information was lost in that transformation?
Data Characteristics: What scales apply to the different data features in the dataset? Are they categorical? ordinal? numerical? What are the underlying distributions of those features (normal, skewed)? What is the variation in the data features and values? (Variety) Are there extreme outliers? How do the different features relate to each other?
Data Quality: How can you be assured of the data quality? (Validity) Are there missing features that can be created and would be useful? Are there missing or noisy data that can be imputed/corrected? Are there outside data sources that can validate or expand the set of reliable features?

3. Know Your Analysis Methods

Method Details: What statistical or machine learning data analysis methods are applied to the data? What are some advantages and shortcomings of those analysis types? What makes them best suited for the data and the research questions? What assumptions do these analysis models make? especially about the number of training samples and features? (Volume) How many parameters do these analysis models have? What are other alternative analysis techniques and the rationale for not conducting them?
Training Procedures: How are data samples stored, extracted, transformed, and used in data analysis and model training? What technologies and software are used? How are the training data samples selected? Are there biases in the training data and can they be accounted for?
Evaluation Metrics: What metrics are being used to evaluate model performance? What are alternative evaluation metrics and the advantages and shortcomings of the selected ones? Are there biases in the evaluation? How are hyperparameters to the model optimized? How does performance change as these parameters vary? How reproducible is the model performance on a different training set? How does model performance change with more training data and time?
Predictive Quality: Model Which attributes of which data samples are well-predicted by a trained model? Are there common patterns in the samples with incorrect predictions? Is there a confidence measure associated with each prediction?
Significant Features: What features of the model are most significant for correct predictions? By what metric/method are these features ranked? What are the assumptions of the feature prioritization method and considerations to keep in mind when interpreting the results?

4. Know Your Team

Target Audiences: Who are the end-users that need to update their behavior based on this data analysis? What are their expectations? needs? and technical expertise in relation to this analysis? What visualizations and reports would they be able to understand independently that provide them with actionable insight?

5. Know the Significance of Your Results

Visual Communication: Do the data analysis visualizations convey the analysis is correct, important, and urgent to act on? Is the information displayed in a way that prioritizes greatest value? What information could be better conveyed with alternative types of visualizations? What assumptions do the visualization techniques use to reduce the visual complexity? Do the data visualization tools allow end-users to follow up with relevant questions on their own? Do the data exploration tools address confidentiality/security concerns?
Context and Scope: What are related questions that are not being addressed in this project? Is this specific scope well reasoned? What would be required in terms of data (Value) or methods to better address the related questions and improve decision making?

6. Know Your Future Impact/Applications

Sustainability of Analysis: What is the time horizon for the applicability of the insights of the data analysis? Are there measures in place to update and/or expand the analysis? identify significant changes in the conclusions? Does the project move medical practice further towards a data-driven culture?