Gestational Diabetes to Type 2 Diabetes Risk Prediction

Problem Overview

This project explores predicting risk of Type 2 diabetes in women with a history of gestational diabetes. According to Dennison et al (2020), 33% of women who suffered with gestational diabetes will also develop type II diabetes within 15 years. Being able to reliably and explainably predict at risk patients allows for health services to take a proactive action.

The Dataset

This dataset is from the aforementioned study by Prashanthan and Prashanthan (2025). Following this, the data was uploaded to Kaggle.

Original dataset: Data for T2DM Risk prediction after GDM Original study paper: Predicting the future risk of developing type 2 diabetes in women with a history of gestational diabetes mellitus using machine learning and explainable artificial intelligence

Project Approach

To summarise the steps taken in this project:

EDA: Explore the dataset, the feature and class distribution and discover feature importance
Model building and finetuning: Testing Decision Trees, Random Forest and XGBoost models to see which model performed well while minimising Type II errors
Selecting the best model based on evaluation metrics and use case
Serialisng the model using pickle to be later deployed
FastAPI Service: Serve predictions following the pipeline product of best model
Containerisation of FastAPI Service: To allow for easy deployment.

Project Structure

To understand how this project is organised, please find a breakdown of how this repository is structured here.

Running this project

This project provides a machine-learning prediction service for estimating the risk of progression from Gestational Diabetes Mellitus (GDM) to Type 2 Diabetes Mellitus (T2DM). It includes:

A training pipeline (offline model building)
A FastAPI inference service (real-time predictions)
A fully containerized deployment using Docker

Prerequisites Please ensure you have UV installed. Installation instructions can be found here: https://docs.astral.sh/uv/getting-started/installation/

Using Deployed API Service

As this endpoint has been deployed using Render, you can visit the following url to test the endpoint:

https://gdm2t2d.onrender.com/docs

As this is a free deployment resource, it will take time for the service to load (around 1min).

Local inference (Docker)

Alternatively you can run the service using docker on your PC. This will require Docker Engine to be installed on your computer.

Clone the repository

git clone https://github.com/ShaniceWilliams/GDM2T2D.git
cd GDM2T2D

Set up local environment Run the following command in your terminal

uv sync

Build the docker image Run the following command to build the docker image:

docker build -t gdm2t2d-predict .

Run the docker image Then you can run the image using the following command.

docker run -it --rm -p 9696:9696 gdm2t2d-predict

Now you can access the service by following the link: http://0.0.0.0:9696/docs

Local inference (Non Docker)

Clone the repository

git clone https://github.com/ShaniceWilliams/GDM2T2D.git
cd GDM2T2D

Set up local environment Run the following command to create virtual environment and install dependencies

uv sync

Run training pipeline Run the following command in your terminal

python -m src.pipeline.train

Make predictions

Run the following command in your terminal

python -m src.pipeline.predict

Deployed Model Summary

Algorithm: Random Forest Classifier
Vectorization: DictVectorizer
Evaluation metrics: AUC, Recall, Precision, F1
Prioritizes minimizing false negatives (Type II error) due to clinical risk.

How did we measure success?

This problem was explored by Prashanthan and Prashanthan (2025) in a study where they managed to acheive the following metrics using an AdaBoost classifier:

Accuracy - 0.89
Precision - 0.74
Recall - 0.88
F1 score - 0.80

This project aimed to replicate this performance or better, whilst minimising Type II errors (false negatives) which would lead to support not being directed to mothers that need it the most.

This was acheived with the final model acheiving the following metrics:

More detail can be found here.

Future Feature Enhancements

Testing other explainable ML models The original paper found AdaBoost to be the best model, which I have yet to test due to time constraints. In the future I would like to test this model and compare with my best model at present.

Visualise feature importances Whilst at some level feature importances were explored, giving the importance of the the model decision explaiability, it would be beneficial to visualise these importances.

Streamlit Dashboard Development For real life application, a dashboard that allows the user to take the following actions would need to be implemented:

Complete a form to indicate the features for new patients
Create new predicitions based on the content of the form
Show the preidiction and next steps required.
Visualise the features that contributed most to the individual prediction

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gestational Diabetes to Type 2 Diabetes Risk Prediction

Problem Overview

The Dataset

Project Approach

Project Structure

Running this project

Using Deployed API Service

Local inference (Docker)

Local inference (Non Docker)

Deployed Model Summary

How did we measure success?

Future Feature Enhancements

About

Uh oh!

Releases

Packages

Languages

swilldd/GDM2T2D

Folders and files

Latest commit

History

Repository files navigation

Gestational Diabetes to Type 2 Diabetes Risk Prediction

Problem Overview

The Dataset

Project Approach

Project Structure

Running this project

Using Deployed API Service

Local inference (Docker)

Local inference (Non Docker)

Deployed Model Summary

How did we measure success?

Future Feature Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages