Skip to content

swilldd/GDM2T2D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastAPI docs for this project

Gestational Diabetes to Type 2 Diabetes Risk Prediction

Problem Overview

This project explores predicting risk of Type 2 diabetes in women with a history of gestational diabetes. According to Dennison et al (2020), 33% of women who suffered with gestational diabetes will also develop type II diabetes within 15 years. Being able to reliably and explainably predict at risk patients allows for health services to take a proactive action.

The Dataset

This dataset is from the aforementioned study by Prashanthan and Prashanthan (2025). Following this, the data was uploaded to Kaggle.

Original dataset: Data for T2DM Risk prediction after GDM Original study paper: Predicting the future risk of developing type 2 diabetes in women with a history of gestational diabetes mellitus using machine learning and explainable artificial intelligence

Project Approach

To summarise the steps taken in this project:

  • EDA: Explore the dataset, the feature and class distribution and discover feature importance
  • Model building and finetuning: Testing Decision Trees, Random Forest and XGBoost models to see which model performed well while minimising Type II errors
  • Selecting the best model based on evaluation metrics and use case
  • Serialisng the model using pickle to be later deployed
  • FastAPI Service: Serve predictions following the pipeline product of best model
  • Containerisation of FastAPI Service: To allow for easy deployment.

Project Structure

To understand how this project is organised, please find a breakdown of how this repository is structured here.

Running this project

This project provides a machine-learning prediction service for estimating the risk of progression from Gestational Diabetes Mellitus (GDM) to Type 2 Diabetes Mellitus (T2DM). It includes:

  • A training pipeline (offline model building)
  • A FastAPI inference service (real-time predictions)
  • A fully containerized deployment using Docker

Prerequisites Please ensure you have UV installed. Installation instructions can be found here: https://docs.astral.sh/uv/getting-started/installation/

Using Deployed API Service

As this endpoint has been deployed using Render, you can visit the following url to test the endpoint:

https://gdm2t2d.onrender.com/docs

As this is a free deployment resource, it will take time for the service to load (around 1min).

Local inference (Docker)

Alternatively you can run the service using docker on your PC. This will require Docker Engine to be installed on your computer.

  1. Clone the repository
git clone https://github.com/ShaniceWilliams/GDM2T2D.git
cd GDM2T2D
  1. Set up local environment Run the following command in your terminal
uv sync
  1. Build the docker image Run the following command to build the docker image:
docker build -t gdm2t2d-predict .
  1. Run the docker image Then you can run the image using the following command.
docker run -it --rm -p 9696:9696 gdm2t2d-predict

Now you can access the service by following the link: http://0.0.0.0:9696/docs

Local inference (Non Docker)

  1. Clone the repository
git clone https://github.com/ShaniceWilliams/GDM2T2D.git
cd GDM2T2D
  1. Set up local environment Run the following command to create virtual environment and install dependencies
uv sync
  1. Run training pipeline Run the following command in your terminal
python -m src.pipeline.train
  1. Make predictions

Run the following command in your terminal

python -m src.pipeline.predict

Deployed Model Summary

  • Algorithm: Random Forest Classifier
  • Vectorization: DictVectorizer
  • Evaluation metrics: AUC, Recall, Precision, F1
  • Prioritizes minimizing false negatives (Type II error) due to clinical risk.

How did we measure success?

This problem was explored by Prashanthan and Prashanthan (2025) in a study where they managed to acheive the following metrics using an AdaBoost classifier:

  • Accuracy - 0.89
  • Precision - 0.74
  • Recall - 0.88
  • F1 score - 0.80

This project aimed to replicate this performance or better, whilst minimising Type II errors (false negatives) which would lead to support not being directed to mothers that need it the most.

This was acheived with the final model acheiving the following metrics: Results of Random Forest Model Applied to test data

More detail can be found here.

Future Feature Enhancements

Testing other explainable ML models The original paper found AdaBoost to be the best model, which I have yet to test due to time constraints. In the future I would like to test this model and compare with my best model at present.

Visualise feature importances Whilst at some level feature importances were explored, giving the importance of the the model decision explaiability, it would be beneficial to visualise these importances.

Streamlit Dashboard Development For real life application, a dashboard that allows the user to take the following actions would need to be implemented:

  • Complete a form to indicate the features for new patients
  • Create new predicitions based on the content of the form
  • Show the preidiction and next steps required.
  • Visualise the features that contributed most to the individual prediction

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages