Ansh Patel — github.com/anshcpatel11
End-to-end machine learning project predicting stroke risk from patient health records, with a deployed interactive Streamlit application. The model accepts patient demographics and clinical measurements and returns a calibrated stroke probability with a Low / Moderate / High risk tier.
Or run locally:
pip install -r requirements.txt
streamlit run app.pyStroke is a leading cause of death and long-term disability worldwide. Early identification of high-risk patients enables preventive intervention — medication, lifestyle changes, and closer monitoring. This project builds a stroke risk classifier using the UCI Stroke Prediction dataset (5,109 records) where only 4.9% of patients experienced a stroke, making this a severely imbalanced binary classification problem.
- Clinical feature engineering — ADA glucose risk tiers (Normal / Prediabetic / Diabetic), WHO BMI categories, stroke-relevant age bands (Under 40 / 40-55 / 55-65 / Over 65), cardiovascular comorbidity score, and age × hypertension / age × heart disease interaction features
- Random oversampling — training set resampled to 4:1 majority-to-minority ratio to handle extreme 20:1 class imbalance; applied to training data only to prevent leakage
- Stacking ensemble — Logistic Regression + Random Forest + HistGradientBoosting with a Logistic Regression meta-learner trained via 5-fold stratified cross-validation
- Threshold optimization — precision-recall sweep across 200 thresholds; optimal found at 0.434
- Permutation importance — model-agnostic feature interpretation that avoids the high-cardinality bias of tree-based importance
- Streamlit deployment — interactive app with gauge chart, risk tier badge, and per-patient risk factor visualization
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.7182 | 0.1281 | 0.82 | 0.2216 | 0.8275 |
| HistGradient Boosting | 0.8836 | 0.1947 | 0.44 | 0.2699 | 0.8186 |
| Stacking Ensemble | 0.8982 | 0.1786 | 0.30 | 0.2239 | 0.7864 |
| Stacking (opt. threshold=0.434) | 0.8963 | 0.2083 | 0.40 | 0.2740 | 0.7864 |
| Random Forest | 0.8102 | 0.1786 | 0.80 | 0.2920 | 0.8343 |
Random Forest achieved the best F1 (0.2920) and ROC-AUC (0.8343). The high recall (0.80) means it correctly identifies 80% of true stroke cases — the most clinically important metric in this context, where missing a high-risk patient is more costly than a false alarm.
├── data/
│ └── healthcare-dataset-stroke-data.csv
├── notebooks/
│ └── stroke_risk_analysis.ipynb ← EDA, feature engineering, model development
├── plots/ ← All visualizations (generated by notebook)
│ ├── eda_overview.png
│ ├── eda_distributions.png
│ ├── eda_categorical.png
│ ├── engineered_feature_rates.png
│ ├── roc_pr_curves.png
│ ├── threshold_optimization.png
│ ├── confusion_matrices.png
│ └── permutation_importance.png
├── app.py ← Streamlit inference app
├── train.py ← Standalone training script (saves pkl)
└── requirements.txt
pip install -r requirements.txt
# Launch the app directly — model trains at startup
streamlit run app.pyTo train and save the model as a standalone pipeline file instead:
python train.pyUCI Stroke Prediction dataset — included in data/, no download needed.
Source: Kaggle — Stroke Prediction Dataset
Features: age, gender, hypertension, heart disease, marital status, work type, residence type, average glucose level, BMI, smoking status.
Class distribution: 4,860 no stroke (95.1%) / 249 stroke (4.9%)
- WHO. (2022). Cardiovascular diseases fact sheet.
- American Diabetes Association. (2023). Standards of Medical Care in Diabetes.
- Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12.