Skip to content

anshcpatel11/stroke-risk-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stroke Risk Predictor

Ansh Patelgithub.com/anshcpatel11

Live Demo

End-to-end machine learning project predicting stroke risk from patient health records, with a deployed interactive Streamlit application. The model accepts patient demographics and clinical measurements and returns a calibrated stroke probability with a Low / Moderate / High risk tier.


Live App

Try it here →

Or run locally:

pip install -r requirements.txt
streamlit run app.py

Problem

Stroke is a leading cause of death and long-term disability worldwide. Early identification of high-risk patients enables preventive intervention — medication, lifestyle changes, and closer monitoring. This project builds a stroke risk classifier using the UCI Stroke Prediction dataset (5,109 records) where only 4.9% of patients experienced a stroke, making this a severely imbalanced binary classification problem.


Key Techniques

  • Clinical feature engineering — ADA glucose risk tiers (Normal / Prediabetic / Diabetic), WHO BMI categories, stroke-relevant age bands (Under 40 / 40-55 / 55-65 / Over 65), cardiovascular comorbidity score, and age × hypertension / age × heart disease interaction features
  • Random oversampling — training set resampled to 4:1 majority-to-minority ratio to handle extreme 20:1 class imbalance; applied to training data only to prevent leakage
  • Stacking ensemble — Logistic Regression + Random Forest + HistGradientBoosting with a Logistic Regression meta-learner trained via 5-fold stratified cross-validation
  • Threshold optimization — precision-recall sweep across 200 thresholds; optimal found at 0.434
  • Permutation importance — model-agnostic feature interpretation that avoids the high-cardinality bias of tree-based importance
  • Streamlit deployment — interactive app with gauge chart, risk tier badge, and per-patient risk factor visualization

Results

Model Accuracy Precision Recall F1 ROC-AUC
Logistic Regression 0.7182 0.1281 0.82 0.2216 0.8275
HistGradient Boosting 0.8836 0.1947 0.44 0.2699 0.8186
Stacking Ensemble 0.8982 0.1786 0.30 0.2239 0.7864
Stacking (opt. threshold=0.434) 0.8963 0.2083 0.40 0.2740 0.7864
Random Forest 0.8102 0.1786 0.80 0.2920 0.8343

Random Forest achieved the best F1 (0.2920) and ROC-AUC (0.8343). The high recall (0.80) means it correctly identifies 80% of true stroke cases — the most clinically important metric in this context, where missing a high-risk patient is more costly than a false alarm.


Repo Structure

├── data/
│   └── healthcare-dataset-stroke-data.csv
├── notebooks/
│   └── stroke_risk_analysis.ipynb    ← EDA, feature engineering, model development
├── plots/                            ← All visualizations (generated by notebook)
│   ├── eda_overview.png
│   ├── eda_distributions.png
│   ├── eda_categorical.png
│   ├── engineered_feature_rates.png
│   ├── roc_pr_curves.png
│   ├── threshold_optimization.png
│   ├── confusion_matrices.png
│   └── permutation_importance.png
├── app.py                            ← Streamlit inference app
├── train.py                          ← Standalone training script (saves pkl)
└── requirements.txt

Setup

pip install -r requirements.txt

# Launch the app directly — model trains at startup
streamlit run app.py

To train and save the model as a standalone pipeline file instead:

python train.py

Dataset

UCI Stroke Prediction dataset — included in data/, no download needed.

Source: Kaggle — Stroke Prediction Dataset

Features: age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, BMI, smoking status.

Class distribution: 4,860 no stroke (95.1%) / 249 stroke (4.9%)


References

  • WHO. (2022). Cardiovascular diseases fact sheet.
  • American Diabetes Association. (2023). Standards of Medical Care in Diabetes.
  • Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12.

About

End-to-end stroke risk prediction with a live Streamlit app. Stacking ensemble with clinical feature engineering and oversampling on the UCI Stroke dataset. Best model: F1 = 0.2920, AUC = 0.8343.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors