Image Analysis Data Science Project

Overview

Medical image processing is a rapidly evolving field that has an important impact on clinical research and practice (2019 review). The investigative tasks involved are varied, including detection and registration of cellular entities, image segmentation and classification, and computer-aided diagnosis and outcome prediction. For this project, we have selected a tutorial that mimics a recent feasibility study which applies the Google Cloud AutoML ML Vision tool to train a deep neural network on over 100,000 NIH Chest X-Ray (CXR) images to learn how to predict from among 14 different pathologies. The main purpose of this tutorial is to gain insight into how to evaluate and interpret the quality of blackbox deep learning models. The CXR dataset, Google AutoML tools, and ideas presented in the tutorial may be useful in guiding potential data science projects. While reading the related papers and working through this analysis, you may want to consider the the critical questions for data analysis related to the DSP Course Competencies.

Tutorial Highlights

Topics: image classification, chest x-rays, convolutional neural networks, model training and evaluation
Methods: AutoML Vision - convolutional neural network with automated architecture search
Data: 100,000 NIH chest x-ray images labelled with 14 different pathologies
Requirements: Google Cloud Project Account, minimal command line skills

Running the Tutorial

In order to run this tutorial, users will need access to a Google Cloud Platform account and should fill out the [cloud access request form]. Once your account has been set up, you will be able to create personal Google Cloud Projects and access Google’s AutoML Vision interfaces for automated architecture search and training of convolutional neural network models. During the tutorial, students will create their own copy of the publicly available NIH Chest X-Ray dataset with over 100,000 labeled images for model training. Completing the tutorial on the full training set will involve several steps that take at least an hour to run and finish, but results in a model that can predict the pathology for a given chest x-ray image. After the tutorial and any project extensions requiring Google Cloud are complete, users will need to delete their Google Cloud projects to halt recurring costs.

Instructions: Full step-by-step instructions for the tutorial can be found here.

Project Extensions and Future Directions

There are many possibilities to extend beyond the tutorial and create a data science project. Projects might focus on how different ways of defining the training input images, labels, and parameters can influence the accuracy and value of the final classification model. Another possibility is using the Google AutoML APIs to build a mobile app that can apply the learned model to medical images that are on a tablet or phone. Alternatively, the Stanford CheXpert and MIT MIMIC dataset are similar, large chest x-ray collections, which can be used to examine the generalizability and difficulties of applying deep learning models across different hospital settings. One could also build models from several other public datasets of labeled medical images including from musculoskeletal radiographs, white blood cells, and cancer. A project may also be designed to investigate building models with higher dimension images (3D scans or time-lapse videos), additional genomic or clinical inputs, or with focuses on tasks other than classification, such as image segmentation.

Image Analysis Data Science Project

by the team

Last Updated: June 23, 2025

Overview

Tutorial Highlights

Running the Tutorial

Project Extensions and Future Directions

Potential Project Resources

Image Analysis Data Science Project

by the team

Last Updated: June 23, 2025

Overview

Tutorial Highlights

Running the Tutorial

Project Extensions and Future Directions

Potential Project Resources

Tutorial Related

University Related