Built with
Python 3.x
β’Anaconda
β’Jupyter-Lab
β’Scikit-learn
β’TensorFlow/Keras
import pandas as pd
df = pd.read_csv('diabetes.csv')
df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome |
---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
df.describe().T.style.bar(subset=['mean'], color='#5fba7d')
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Glucose | 768 | 120.89 | 31.97 | 0 | 99 | 117 | 140.25 | 199 |
BMI | 768 | 31.99 | 7.88 | 0 | 27.3 | 32 | 36.6 | 67.1 |
Pair-plot showing correlations among features; red points are diabetic (Outcome=1).
- Strongest predictor: Glucose levels
- Missing values: 0 in Insulin & SkinThickness β impute with median.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop('Outcome', axis=1)
y = df['Outcome']
# Impute 0's β median
cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
X[cols] = X[cols].replace(0, X[cols].median())
# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split 70/30
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.30, random_state=42, stratify=y)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential([
Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
Dropout(0.2),
Dense(8, activation='relu'),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(X_train, y_train,
validation_split=0.15,
epochs=100, batch_size=16, verbose=0)
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 0.794 | 0.75 | 0.64 | 0.69 |
Decision Tree | 0.739 | 0.70 | 0.60 | 0.65 |
SVM (RBF) | 0.792 | 0.76 | 0.62 | 0.68 |
Metric | Neural Net | Logistic | Decision Tree | SVM |
---|---|---|---|---|
Accuracy | 0.844 π | 0.794 | 0.739 | 0.792 |
Precision | 0.81 | 0.75 | 0.70 | 0.76 |
Recall | 0.73 | 0.64 | 0.60 | 0.62 |
F1-Score | 0.77 | 0.69 | 0.65 | 0.68 |
π₯ Neural Network wins with 84.4 % accuracy.
# Keras model
model.save('diabetes_nn.h5')
# Sci-kit models
import joblib
joblib.dump(lr, 'diabetes_lr.pkl')
joblib.dump(dt, 'diabetes_dt.pkl')
joblib.dump(svm, 'diabetes_svm.pkl')
# Load & predict
from tensorflow.keras.models import load_model
model = load_model('diabetes_nn.h5')
patient = [[6, 148, 72, 35, 0, 33.6, 0.627, 50]]
patient_scaled = scaler.transform(patient)
pred = model.predict(patient_scaled)[0][0]
print("Risk of diabetes: {:.1%}".format(pred))
# β Risk of diabetes: 91.4%
π¦ Diabetes-Prediction/
ββ π data/
β ββ diabetes.csv
ββ π notebooks/
β ββ EDA.ipynb
ββ π models/
β ββ diabetes_nn.h5
β ββ *.pkl
ββ π src/
β ββ train.py
β ββ predict.py
ββ π requirements.txt
ββ π README.md
pandas==2.2.2
numpy==1.26.4
matplotlib==3.9.0
seaborn==0.13.2
scikit-learn==1.5.0
tensorflow==2.17.0
joblib==1.4.2
Feel free to open issues or PRs to improve the model or add new features (e.g., SHAP explainability, Streamlit GUI).
MIT Β© 2025 Diabetes-Prediction-Team
-------------------------------------------------------------------------------------------------------------------