Electronic Medical Records Data Science Project

Overview
Tutorial Highlights
Running the Tutorial
Project Extensions and Future Directions
Potential Project Resources
- Tutorial Related
- University Related

Overview

We have extended the functionality of the DoctorAI codebase (see below references) to include scripts to help you unpack the data from the MIMIC download, process the raw ICD-9 codes into CCS code categories, train the prediction model, test the prediction model, and then translate the CCS codes back into text from their number representations used for training/testing.

This project uses the MIMIC-III dataset, which is freely available, but requires credentialling through CITI and MIT. If you have not already completed the process to get access to the database, see physionet https://physionet.org/content/mimiciii/1.4/, you need to complete that before you can run this project. We highly recommend reading the overview of the dataset before beginning this project.

This project will walk you through an implementation using EMR ICD-9 data codes from the MIMIC dataset to predict future hospital events. Specifically, it uses the ADMISSIONS.csv files which contain diagnosis codes. Additional codes in the MIMIC dataset include procedure and drug codes. While reading the related papers and working through this analysis, you may want to consider the the critical questions for data analysis related to the DSP Course Competencies.

Tutorial Highlights

Topics: predicting admission events from patient visit diagnostic codes
Methods: recurrent neural networks (DoctorAI)
Data: MIMIC-III database of demographics, vital signs, test results, medications, etc., for 40K critical care patients
Requirements: credentialled access MIMIC-III database, Amazon Web Services account to access AWS Cloud9 development environment

Running the Tutorial

In order to run this tutorial, you will need access to a Amazon Web Services account and should fill out the [cloud access request form]. You will also need to gain credentialled access to the MIMIC-III database, using the process described in Step 0 of the tutorial. This process can take a while, so please request access as soon as possible. This tutorial involves reading in the raw codes and data from files (csv), processing the EMR codes of interest, training a neural network to learn which codes follow which in time, and then the neural net will hopefully be able to predict future codes. MIMIC III uses ICD-9 codes, so to update this project for a more recent dataset using ICD-10 codes will take some editing to ensure the codes are translated correctly. The tutorial should take less than 1 hour to run if no modifications are made.

Instructions: Full step-by-step instructions for the tutorial can be found here.

Project Extensions and Future Directions

There are many possible directions to take this project. In general, you could look to update or extend the model to a different ML technique, which would require a bit more work behind the scenes to develop and test. On the other hand, you could also look to apply this approach to other datasets from EMR, as long as you can de-identify and format the raw information like the csv’s provided in the EMR. This would require more effort in pre-processing the data.

The ways in which you could extend this project include modifying this model to predict timing of events, predict a specific type of diagnosis only, or extend it to predict drug and procedure events in addition to diagnostics. There have been many dozens/hundreds of papers published using this dataset ( MIMIC III on Scopus ) which you are also encouraged to consider as inspiration for other possible directions.

Another possibility would be to adapt the model to look for very specific conditions as an early warning flag instead of predicting all possible future codes. This would require modification of the pre-processing of the data provided, as well as an extension of the interpretation of the model’s prediction output to make it more specific to the desired output.

Yet another possibility would be to change the focus of the model from diagnostic codes to drug codes or procedure codes, to see if there is any predictive model that might work for drugs or procedures in addition to diagnoses.

This model is relatively fast to train, so feel free to explore these or other options.

Potential Project Resources

Datasets at healthcare.gov

Electronic Medical Records Data Science Project

by the team

Last Updated: June 23, 2025

Overview

Tutorial Highlights

Running the Tutorial

Project Extensions and Future Directions

Potential Project Resources

Electronic Medical Records Data Science Project

by the team

Last Updated: June 23, 2025

Overview

Tutorial Highlights

Running the Tutorial

Project Extensions and Future Directions

Potential Project Resources

Tutorial Related

University Related