Objectives
The main objective of this phase is to prepare the Phase I report.
This will serve as the dataset(s) description and summary of your final
report.
- Explore the template projects
- Explore cloud computing resources
- Find your teammate(s)
- Find publicly available dataset and perform data processing and
summarization
- Each student submits the [Student Team
Agreement]
- Contact and communicate with two mentors and setup a meeting with
them to discuss your data and research goal
- Write and submit the Phase I report as a team through the [Report Submission
form]
- Follow up with mentors and have each submit the [Mentor Confirmation
form]
After submitting the report, you may also need to
- Integrate feedback on your Phase I report
- Meet with course instructors if you need help finding mentors
Timeline and Key Dates
- Attend the Kickoff Meeting at Mar 14, 2022
- [5 Points] Each student submits the [Student Team Agreement]
form by Apr 8, 2022
- [20 Points] Each team submits the Phase I Report by
May 6, 2022
- [Required] Ask your mentors to submit the [Mentor Confirmation form
by May 20, 2022
- You should have two mentors, but at least one must provide at least
a guidance role for your project
- This contact mentor’s evaluations will weigh more in your final
grade
Phase I Report Contents
The Phase I Report mainly consists of describing the provenance,
characteristics, and quality of dataset(s) that will be the likely focus
of your project. The report should be well-organized and answer
questions about the suitability of the dataset for a clinically-relevant
data analysis project. Beside describing any data use
agreements/requirements, data processing tools, and data summary
statistics, the reports should attempt to briefly answer some of the
following questions:
- Data Provenance: Where does the data come from? How
and when was it obtained (Velocity)? What assumptions were made in its
acquisition? What aspects of reality were captured well/poorly? How was
the data transformed for sharing and preparation for data analysis? What
information was lost in that transformation?
- Data Characteristics: What scales apply to the
different data features in the dataset? Are they categorical? ordinal?
numerical? What are the underlying distributions of those features
(normal, skewed)? What is the variation in the data features and values
(Variety)? Are there extreme outliers? How do the different features
relate to each other?
- Data Quality: How can you be assured of the data
quality (Validity)? Are there missing features that can be created and
would be useful? Are there missing or noisy data that can be
imputed/corrected? Are there outside data sources that can validate or
expand the set of reliable features?