A comprehensive machine learning project that predicts diabetes risk using health indicators and symptoms. This project follows a complete ML engineering workflow from data exploration to deployment-ready models.
- Objective: Develop an accurate binary classification model to predict diabetes risk
- Dataset: Early stage diabetes risk prediction dataset from UCI ML Repository
- Models: 8 different classification algorithms compared systematically
- Approach: Complete ML engineering workflow with proper train/validation/test splits
- Comprehensive EDA: Deep data exploration with visualizations
- Multiple Models: Decision Tree, Naive Bayes, Logistic Regression, KNN, SVM, Neural Network, Random Forest, Gradient Boosting
- Proper Validation: 60/20/20 train/validation/test split to prevent overfitting
- Medical Metrics: Focus on sensitivity/specificity for healthcare applications
- Production Ready: Model serialization and deployment guidelines
-
Clone the repository
git clone https://github.com/ralphcajipe/diabetes-prediction.git cd diabetes-prediction
-
Install dependencies
pip install pandas numpy scikit-learn matplotlib seaborn jupyter
-
Run the notebook
jupyter notebook diabaetes_risk_prediction.ipynb
diabetes-prediction/
├── diabaetes_risk_prediction.ipynb # Main analysis notebook
├── dataset/
│ └── diabetes_data_upload.csv # Dataset file
├── models/ # Saved model artifacts (generated)
│ ├── best_diabetes_model.pkl
│ ├── feature_scaler.pkl
│ └── feature_names.pkl
└── README.md # Project documentation
- Define clear success criteria (>90% accuracy, balanced precision/recall)
- Establish baseline performance metrics
- Exploratory Data Analysis (EDA) with comprehensive visualizations
- Data preprocessing and feature encoding
- Strategic train/validation/test splitting (60/20/20)
- Systematic evaluation of 8 classification algorithms
- Model comparison with validation metrics
- Error analysis focusing on medical implications
- Model serialization for production use
- Deployment strategy and monitoring guidelines
The project evaluates multiple algorithms:
- Decision Tree Classifier
- Gaussian Naive Bayes
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine
- Neural Network (MLP)
- Random Forest
- Gradient Boosting
Results available after running the notebook
- Low False Negative Rate: Minimizes missed diabetes cases
- Interpretable Predictions: Clear confidence scores for clinical decisions
- Feature Importance: Identifies key health indicators
Source: UCI Machine Learning Repository
Features: Age, Gender, Polyuria, Polydipsia, Sudden Weight Loss, Weakness, Obesity, and more
Target: Binary classification (Diabetes: Yes/No)
- Python 3.x
- Pandas - Data manipulation
- NumPy - Numerical computations
- Scikit-learn - Machine learning algorithms
- Matplotlib/Seaborn - Data visualization
- Jupyter Notebook - Interactive development
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.