The Data Science Project is intended to test and develop your
abilities in the following six areas. For each area, we have provided
some sample questions to ask and consider while completing a data
science project. These questions are intended as a guide for critical
thinking as you read through or conduct data analysis.
1. Know Your Problem
- Problem Definitions: What exactly do you want to
find out? What is the specific question? How and for who does the answer
impact clinical/medical decision-making?
- Metric Selections: What is the best metric/measure
for answering the specific question? What are alternative metrics? What
are the advantages and shortcomings of the best metric?
2. Know Your Data
- Data Provenance: Where does the data come from? How
and when was it obtained? (Velocity) What assumptions were made in its
acquisition? What aspects of reality were captured well/poorly? How was
the data transformed for sharing and preparation for data analysis? What
information was lost in that transformation?
- Data Characteristics: What scales apply to the
different data features in the dataset? Are they categorical? ordinal?
numerical? What are the underlying distributions of those features
(normal, skewed)? What is the variation in the data features and values?
(Variety) Are there extreme outliers? How do the different features
relate to each other?
- Data Quality: How can you be assured of the data
quality? (Validity) Are there missing features that can be created and
would be useful? Are there missing or noisy data that can be
imputed/corrected? Are there outside data sources that can validate or
expand the set of reliable features?
3. Know Your Analysis Methods
- Method Details: What statistical or machine
learning data analysis methods are applied to the data? What are some
advantages and shortcomings of those analysis types? What makes them
best suited for the data and the research questions? What assumptions do
these analysis models make? especially about the number of training
samples and features? (Volume) How many parameters do these analysis
models have? What are other alternative analysis techniques and the
rationale for not conducting them?
- Training Procedures: How are data samples stored,
extracted, transformed, and used in data analysis and model training?
What technologies and software are used? How are the training data
samples selected? Are there biases in the training data and can they be
accounted for?
- Evaluation Metrics: What metrics are being used to
evaluate model performance? What are alternative evaluation metrics and
the advantages and shortcomings of the selected ones? Are there biases
in the evaluation? How are hyperparameters to the model optimized? How
does performance change as these parameters vary? How reproducible is
the model performance on a different training set? How does model
performance change with more training data and time?
- Predictive Quality: Model Which attributes of which
data samples are well-predicted by a trained model? Are there common
patterns in the samples with incorrect predictions? Is there a
confidence measure associated with each prediction?
- Significant Features: What features of the model
are most significant for correct predictions? By what metric/method are
these features ranked? What are the assumptions of the feature
prioritization method and considerations to keep in mind when
interpreting the results?
4. Know Your Team
- Target Audiences: Who are the end-users that need
to update their behavior based on this data analysis? What are their
expectations? needs? and technical expertise in relation to this
analysis? What visualizations and reports would they be able to
understand independently that provide them with actionable insight?
5. Know the Significance of Your Results
- Visual Communication: Do the data analysis
visualizations convey the analysis is correct, important, and urgent to
act on? Is the information displayed in a way that prioritizes greatest
value? What information could be better conveyed with alternative types
of visualizations? What assumptions do the visualization techniques use
to reduce the visual complexity? Do the data visualization tools allow
end-users to follow up with relevant questions on their own? Do the data
exploration tools address confidentiality/security concerns?
- Context and Scope: What are related questions that
are not being addressed in this project? Is this specific scope well
reasoned? What would be required in terms of data (Value) or methods to
better address the related questions and improve decision making?
6. Know Your Future Impact/Applications
- Sustainability of Analysis: What is the time
horizon for the applicability of the insights of the data analysis? Are
there measures in place to update and/or expand the analysis? identify
significant changes in the conclusions? Does the project move medical
practice further towards a data-driven culture?