Stroke Risk Predictor

Ansh Patel — github.com/anshcpatel11

End-to-end machine learning project predicting stroke risk from patient health records, with a deployed interactive Streamlit application. The model accepts patient demographics and clinical measurements and returns a calibrated stroke probability with a Low / Moderate / High risk tier.

Live App

Try it here →

Or run locally:

pip install -r requirements.txt
streamlit run app.py

Problem

Stroke is a leading cause of death and long-term disability worldwide. Early identification of high-risk patients enables preventive intervention — medication, lifestyle changes, and closer monitoring. This project builds a stroke risk classifier using the UCI Stroke Prediction dataset (5,109 records) where only 4.9% of patients experienced a stroke, making this a severely imbalanced binary classification problem.

Key Techniques

Clinical feature engineering — ADA glucose risk tiers (Normal / Prediabetic / Diabetic), WHO BMI categories, stroke-relevant age bands (Under 40 / 40-55 / 55-65 / Over 65), cardiovascular comorbidity score, and age × hypertension / age × heart disease interaction features
Random oversampling — training set resampled to 4:1 majority-to-minority ratio to handle extreme 20:1 class imbalance; applied to training data only to prevent leakage
Stacking ensemble — Logistic Regression + Random Forest + HistGradientBoosting with a Logistic Regression meta-learner trained via 5-fold stratified cross-validation
Threshold optimization — precision-recall sweep across 200 thresholds; optimal found at 0.434
Permutation importance — model-agnostic feature interpretation that avoids the high-cardinality bias of tree-based importance
Streamlit deployment — interactive app with gauge chart, risk tier badge, and per-patient risk factor visualization

Results

Model	Accuracy	Precision	Recall	F1	ROC-AUC
Logistic Regression	0.7182	0.1281	0.82	0.2216	0.8275
HistGradient Boosting	0.8836	0.1947	0.44	0.2699	0.8186
Stacking Ensemble	0.8982	0.1786	0.30	0.2239	0.7864
Stacking (opt. threshold=0.434)	0.8963	0.2083	0.40	0.2740	0.7864
Random Forest	0.8102	0.1786	0.80	0.2920	0.8343

Random Forest achieved the best F1 (0.2920) and ROC-AUC (0.8343). The high recall (0.80) means it correctly identifies 80% of true stroke cases — the most clinically important metric in this context, where missing a high-risk patient is more costly than a false alarm.

Repo Structure

├── data/
│   └── healthcare-dataset-stroke-data.csv
├── notebooks/
│   └── stroke_risk_analysis.ipynb    ← EDA, feature engineering, model development
├── plots/                            ← All visualizations (generated by notebook)
│   ├── eda_overview.png
│   ├── eda_distributions.png
│   ├── eda_categorical.png
│   ├── engineered_feature_rates.png
│   ├── roc_pr_curves.png
│   ├── threshold_optimization.png
│   ├── confusion_matrices.png
│   └── permutation_importance.png
├── app.py                            ← Streamlit inference app
├── train.py                          ← Standalone training script (saves pkl)
└── requirements.txt

Setup

pip install -r requirements.txt

# Launch the app directly — model trains at startup
streamlit run app.py

To train and save the model as a standalone pipeline file instead:

python train.py

Dataset

UCI Stroke Prediction dataset — included in data/, no download needed.

Source: Kaggle — Stroke Prediction Dataset

Features: age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, BMI, smoking status.

Class distribution: 4,860 no stroke (95.1%) / 249 stroke (4.9%)

References

WHO. (2022). Cardiovascular diseases fact sheet.
American Diabetes Association. (2023). Standards of Medical Care in Diabetes.
Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Risk Predictor

Live App

Problem

Key Techniques

Results

Repo Structure

Setup

Dataset

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
plots		plots
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Stroke Risk Predictor

Live App

Problem

Key Techniques

Results

Repo Structure

Setup

Dataset

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages