A machine learning project to predict the likelihood of stroke using health data. This project showcases end-to-end data science skills from EDA to deployment, with a focus on real-world healthcare applications.
Enter health information such as age, glucose level, BMI, etc., and get an instant stroke risk prediction using a machine learning model trained on real data.
- Dataset
- Exploratory Data Analysis
- Smoking Status Analysis
- Residence & Work Type Analysis
- Data Preprocessing
- Model Building
- Deployment
- Folder Structure
- Author
- Source: Kaggle - Stroke Prediction Dataset
- Records: 5,110
- Target Variable:
stroke(1 = stroke occurred, 0 = no stroke) β οΈ Class Imbalance: Only ~4.9% of patients had a stroke
Initial steps involved:
- Checked column types, missing values, and value counts
- Identified 201 missing values in the
bmicolumn - Detected significant class imbalance (~95% no stroke)
- Created visualizations:
- Stroke Count Distribution
- Stroke Distribution by Gender
Analyzed stroke prevalence across:
"formerly smoked", "never smoked", "smokes", and "unknown".
- "Never smoked" is the largest group.
- "Formerly smoked" shows slightly higher stroke proportion.
- "Smokes" has fewer stroke cases than expected.
- "Unknown" still includes stroke cases.
π‘ Insight: Stroke risk spans across all smoking categories β confirming that stroke is multifactorial.
- Urban stroke rate: ~5.2%
- Rural stroke rate: ~4.5%
- π‘ Slightly higher in urban areas β possibly due to stress or better diagnosis access.
- Self-employed: Highest stroke proportion (~7.9%)
- Private/Govt jobs: ~5%
- Children: Minimal strokes
- π‘ Self-employed individuals may experience more health risks due to irregular schedules or lack of access to care.
Steps completed:
- β
Filled missing
bmivalues using median imputation - β
Encoded categorical variables using
pd.get_dummies - β
Normalized numeric features with
StandardScaler - β Split dataset into train/test (80/20)
- Severe class imbalance (~95% no-stroke vs ~5% stroke)
- Baseline models had high accuracy but failed to detect stroke cases
Used SMOTE to synthetically create more stroke samples in the training set.
- Balanced training set: 50% stroke / 50% no-stroke
- Enabled models to learn patterns in the minority class
| Metric | Value |
|---|---|
| Accuracy | 74.85% |
| Recall (Stroke=1) | 0.80 β |
| Precision | 0.14 |
| F1-Score | 0.24 |
π― Best for real-world screening where false negatives are costly
β
Recommended for healthcare settings
| Metric | Value |
|---|---|
| Accuracy | 91.78% β |
| Recall (Stroke=1) | 0.14 β |
| Precision | 0.15 |
| F1-Score | 0.14 |
- Random Forest has better accuracy
- But Logistic Regression is better for identifying stroke patients β critical in healthcare
Final models were saved using joblib:
joblib.dump(rf_smote, 'models/random_forest_stroke_model.joblib')
joblib.dump(scaler, 'models/standard_scaler.joblib')
π Deployment
## π Live Demo
The stroke prediction web app is live and can be accessed here:
π [https://stroke-prediction-analysis.onrender.com](https://stroke-prediction-analysis.onrender.com)
### π οΈ How to Use
1. Go to the [Live App](https://stroke-prediction-analysis.onrender.com/)
2. Enter health details like age, BMI, glucose levels, etc.
3. Click **"Predict Stroke Risk"**
4. See the result: π¨ "Stroke Risk" or π "No Stroke"
βοΈ **Custom Threshold**
Lowered the prediction threshold to `0.3` (from default `0.5`) to **prioritize recall** and catch more true stroke cases β important in healthcare where false negatives are riskier than false positives.
### π ROC-AUC Evaluation
To further assess the model's ability to distinguish between stroke and non-stroke cases, we plotted the **Receiver Operating Characteristic (ROC) Curve** and calculated the **Area Under the Curve (AUC)**.
#### β
Cross-validated AUC scores (on SMOTE-resampled training data):
[0.9840, 0.9981, 0.9966, 0.9972, 0.9967]
Mean AUC Score: 0.9945
This high score indicates that the model is **very effective** at separating stroke from non-stroke classes in training, thanks to balanced resampling using **SMOTE**.
#### π§ͺ ROC Curve on Test Set:
- **AUC = 0.76**
- Shows solid performance on unseen data.
- While lower than the training score (as expected), it still reflects **meaningful predictive power** in identifying stroke risks.
> βΉοΈ **Why AUC drops on test data**: SMOTE balances only the training data. The test set retains the original class imbalance, making it harder to predict stroke cases. This drop is natural and reflects real-world conditions.
ROC Curve :
<img src="notebooks/roc_curve.png" width="500"/>
project/
β
βββ data/ # Raw dataset
βββ notebooks/ # EDA and modeling notebooks
βββ scripts/ # Preprocessing and training scripts
βββ templates/ # HTML templates for Flask app
βββ models/ # Saved .joblib models
βββ app.py # Flask backend
βββ README.md # Project summary and documentation
βββ requirements.txt # Python dependencies
---
## π§ͺ How to Use
1. Visit the [Live App](https://stroke-prediction-analysis.onrender.com/)
2. Enter patient info into the form
3. Submit to receive real-time stroke risk prediction (based on trained ML model)
---
## βοΈ Tech Stack
- Python 3.11
- Flask
- Scikit-learn
- SMOTE (Imbalanced-learn)
- Pandas, NumPy, Matplotlib
- Hosted on Render
## π What I Learned
This project deepened my understanding of real-world data science workflows, particularly in healthcare prediction problems. Key takeaways:
- **End-to-End Pipeline:** I built a full ML pipeline from EDA β preprocessing β model training β deployment, reinforcing the importance of structure and iteration.
- **Class Imbalance & SMOTE:** I learned how imbalanced datasets can lead to misleading model performance and how SMOTE can help improve recall β a critical metric in healthcare applications.
- **Evaluation Trade-offs:** I understood how accuracy isn't always the best metric. In this case, **recall** was prioritized because missing a stroke case is more dangerous than a false positive.
- **Model Deployment:** I implemented my first Flask app and deployed it using **Render**, gaining confidence in converting notebooks into usable applications.
- **Git & Version Control:** Managed my project with Git and GitHub, learning to track experiments, sync changes, and maintain a clean structure.
- **Communication:** Writing markdown summaries helped me practice translating technical insights into business-relevant language.
This was more than just a machine learning task β it was a full-stack learning experience.
π€ Author
Vivek Raghunathan
π LinkedIn
π» GitHub: @vivekr25