Skip to content

vivekr25/stroke-prediction-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Stroke Prediction Analysis

A machine learning project to predict the likelihood of stroke using health data. This project showcases end-to-end data science skills from EDA to deployment, with a focus on real-world healthcare applications.

Live Demo GitHub Repo

Enter health information such as age, glucose level, BMI, etc., and get an instant stroke risk prediction using a machine learning model trained on real data.

πŸ“š Table of Contents


πŸ“ Dataset

  • Source: Kaggle - Stroke Prediction Dataset
  • Records: 5,110
  • Target Variable: stroke (1 = stroke occurred, 0 = no stroke)
  • ⚠️ Class Imbalance: Only ~4.9% of patients had a stroke

πŸ“Š Exploratory Data Analysis (EDA)

Initial steps involved:

  • Checked column types, missing values, and value counts
  • Identified 201 missing values in the bmi column
  • Detected significant class imbalance (~95% no stroke)
  • Created visualizations:
    • Stroke Count Distribution
    • Stroke Distribution by Gender

🚬 Stroke Distribution by Smoking Status

Analyzed stroke prevalence across:
"formerly smoked", "never smoked", "smokes", and "unknown".

πŸ” Observations:

  • "Never smoked" is the largest group.
  • "Formerly smoked" shows slightly higher stroke proportion.
  • "Smokes" has fewer stroke cases than expected.
  • "Unknown" still includes stroke cases.

πŸ’‘ Insight: Stroke risk spans across all smoking categories β€” confirming that stroke is multifactorial.


🏘️ Residence & Work Type Analysis

🏑 Residence Type:

  • Urban stroke rate: ~5.2%
  • Rural stroke rate: ~4.5%
  • πŸ’‘ Slightly higher in urban areas β€” possibly due to stress or better diagnosis access.

πŸ’Ό Work Type:

  • Self-employed: Highest stroke proportion (~7.9%)
  • Private/Govt jobs: ~5%
  • Children: Minimal strokes
  • πŸ’‘ Self-employed individuals may experience more health risks due to irregular schedules or lack of access to care.

🧼 Data Preprocessing

Steps completed:

  • βœ… Filled missing bmi values using median imputation
  • βœ… Encoded categorical variables using pd.get_dummies
  • βœ… Normalized numeric features with StandardScaler
  • βœ… Split dataset into train/test (80/20)

πŸ€– Model Building

⚠️ Problem:

  • Severe class imbalance (~95% no-stroke vs ~5% stroke)
  • Baseline models had high accuracy but failed to detect stroke cases

πŸ”„ SMOTE: Synthetic Minority Over-sampling Technique

Used SMOTE to synthetically create more stroke samples in the training set.

βœ… Result:

  • Balanced training set: 50% stroke / 50% no-stroke
  • Enabled models to learn patterns in the minority class

πŸ“ˆ Model Comparison

πŸ”Ή Logistic Regression (with SMOTE)

Metric Value
Accuracy 74.85%
Recall (Stroke=1) 0.80 βœ…
Precision 0.14
F1-Score 0.24

🎯 Best for real-world screening where false negatives are costly
βœ… Recommended for healthcare settings


🌲 Random Forest (with SMOTE)

Metric Value
Accuracy 91.78% βœ…
Recall (Stroke=1) 0.14 ❌
Precision 0.15
F1-Score 0.14

⚠️ Prioritizes overall accuracy, but misses most stroke cases


🧠 Takeaway:

  • Random Forest has better accuracy
  • But Logistic Regression is better for identifying stroke patients β€” critical in healthcare

πŸ’Ύ Model Saving

Final models were saved using joblib:

joblib.dump(rf_smote, 'models/random_forest_stroke_model.joblib')
joblib.dump(scaler, 'models/standard_scaler.joblib')

πŸš€ Deployment 

## πŸš€ Live Demo

The stroke prediction web app is live and can be accessed here:  
πŸ”— [https://stroke-prediction-analysis.onrender.com](https://stroke-prediction-analysis.onrender.com)

### πŸ› οΈ How to Use
1. Go to the [Live App](https://stroke-prediction-analysis.onrender.com/)
2. Enter health details like age, BMI, glucose levels, etc.
3. Click **"Predict Stroke Risk"**
4. See the result: 🚨 "Stroke Risk" or 😊 "No Stroke"

βš™οΈ **Custom Threshold**
Lowered the prediction threshold to `0.3` (from default `0.5`) to **prioritize recall** and catch more true stroke cases β€” important in healthcare where false negatives are riskier than false positives.

### πŸ“ˆ ROC-AUC Evaluation

To further assess the model's ability to distinguish between stroke and non-stroke cases, we plotted the **Receiver Operating Characteristic (ROC) Curve** and calculated the **Area Under the Curve (AUC)**.

#### βœ… Cross-validated AUC scores (on SMOTE-resampled training data):
[0.9840, 0.9981, 0.9966, 0.9972, 0.9967]
Mean AUC Score: 0.9945

This high score indicates that the model is **very effective** at separating stroke from non-stroke classes in training, thanks to balanced resampling using **SMOTE**.

#### πŸ§ͺ ROC Curve on Test Set:
- **AUC = 0.76**
- Shows solid performance on unseen data.
- While lower than the training score (as expected), it still reflects **meaningful predictive power** in identifying stroke risks.

> ℹ️ **Why AUC drops on test data**: SMOTE balances only the training data. The test set retains the original class imbalance, making it harder to predict stroke cases. This drop is natural and reflects real-world conditions.

ROC Curve :

<img src="notebooks/roc_curve.png" width="500"/>


project/
β”‚
β”œβ”€β”€ data/                  # Raw dataset
β”œβ”€β”€ notebooks/             # EDA and modeling notebooks
β”œβ”€β”€ scripts/               # Preprocessing and training scripts
β”œβ”€β”€ templates/             # HTML templates for Flask app
β”œβ”€β”€ models/                # Saved .joblib models
β”œβ”€β”€ app.py                 # Flask backend
β”œβ”€β”€ README.md              # Project summary and documentation
└── requirements.txt       # Python dependencies

---

## πŸ§ͺ How to Use

1. Visit the [Live App](https://stroke-prediction-analysis.onrender.com/)
2. Enter patient info into the form
3. Submit to receive real-time stroke risk prediction (based on trained ML model)

---

## βš™οΈ Tech Stack

- Python 3.11
- Flask
- Scikit-learn
- SMOTE (Imbalanced-learn)
- Pandas, NumPy, Matplotlib
- Hosted on Render

## πŸŽ“ What I Learned

This project deepened my understanding of real-world data science workflows, particularly in healthcare prediction problems. Key takeaways:

- **End-to-End Pipeline:** I built a full ML pipeline from EDA β†’ preprocessing β†’ model training β†’ deployment, reinforcing the importance of structure and iteration.

- **Class Imbalance & SMOTE:** I learned how imbalanced datasets can lead to misleading model performance and how SMOTE can help improve recall β€” a critical metric in healthcare applications.

- **Evaluation Trade-offs:** I understood how accuracy isn't always the best metric. In this case, **recall** was prioritized because missing a stroke case is more dangerous than a false positive.

- **Model Deployment:** I implemented my first Flask app and deployed it using **Render**, gaining confidence in converting notebooks into usable applications.

- **Git & Version Control:** Managed my project with Git and GitHub, learning to track experiments, sync changes, and maintain a clean structure.

- **Communication:** Writing markdown summaries helped me practice translating technical insights into business-relevant language.

This was more than just a machine learning task β€” it was a full-stack learning experience.

πŸ‘€ Author

Vivek Raghunathan
πŸ”— LinkedIn
πŸ’» GitHub: @vivekr25

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages