This repository contains a stroke risk prediction project built in R using decision tree models. It includes: - A cleaned R Markdown file with full code and explanations and a polished case study summarizing the project for non-technical audiences.
The goal is to demonstrate how interpretable machine learning models like decision trees can help healthcare professionals identify at-risk patients early.
- Decision Tree Modeling: Compares a full model with all predictors to a simplified model using only the top variables
- Variable Importance Analysis: Identifies key health and demographic factors linked to stroke
- ROC & AUC Evaluation: Shows model performance visually and numerically
- Imbalance Handling: Uses upsampling to better detect stroke cases
- Interpretability: Extracts and explains model rules for clinical application
- stroke_model_comparison.Rmd β Full R Markdown with data prep, modeling, evaluation
- Stroke_Risk_Prediction_Case_Study.docx β Word case study summary
- stroke dataset.xlsx β Sample dataset
-
Download the files.
-
Open the stroke_model_comparison.Rmd in RStudio.
-
Ensure required packages are installed:
install.packages(c("readxl", "dplyr", "caret", "rpart", "rpart.plot", "pROC", "corrplot"))
-
Load the dataset (update path in the R Markdown if needed).
-
Knit to HTML or Word to view full analysis.
- Age, glucose level, and hypertension are the strongest predictors of stroke.
- The simplified decision tree (top 6 predictors) slightly improved AUC (0.83) and made interpretation easier for clinicians.
- Upsampling improved the modelβs ability to detect stroke cases in an imbalanced dataset.
- Add Random Forest or Gradient Boosting models for comparison.
- Integrate SMOTE or cost-sensitive methods for even better sensitivity.
- Package best model into a Shiny app for hospitals and clinics.
- R β Core language for analysis
- caret β Modeling & cross-validation
- rpart & rpart.plot β Decision tree modeling & visualization
- pROC β ROC curve and AUC calculation
- corrplot β Correlation heatmap
The dataset used in this project comes from Kaggle:
Stroke Prediction Dataset by fedesoriano
License: The dataset is publicly available on Kaggle for educational and research use. Credit: Created by Kaggle user fedesoriano.
This project was completed by Tori Green as part of a portfolio in data science and healthcare analytics. Contributions and feedback are welcome.