GitHub - Manav13254/Data-Science-Project

Diabetes Prediction Project This project involves an end-to-end Exploratory Data Analysis (EDA) and the development of a Predictive Model to determine whether a patient has diabetes based on diagnostic measurements. The analysis is performed using the Pima Indians Diabetes Database.

📋 Table of Contents Project Overview

Dataset Description

Dependencies

Key Features of the Analysis

Model Performance

Visualizations

🚀 Project Overview The primary goal of this notebook is to clean the diagnostic data, explore relationships between health metrics (like Glucose, BMI, and Age), and train a Machine Learning model (Logistic Regression) to predict diabetes outcomes.

📊 Dataset Description The dataset used is diabetes.csv. It contains 768 observations with the following features:

Glucose: Plasma glucose concentration.

BloodPressure: Diastolic blood pressure (mm Hg).

SkinThickness: Triceps skin fold thickness (mm).

Insulin: 2-hour serum insulin (mu U/ml).

BMI: Body mass index (weight in kg/(height in m)^2).

DiabetesPedigreeFunction: A function which scores likelihood of diabetes based on family history.

Age: Age in years.

Outcome: Class variable (0 if non-diabetic, 1 if diabetic).

🛠 Dependencies To run this notebook, you will need the following Python libraries:

Python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy as sp from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc 🔍 Key Features of the Analysis Descriptive Statistics: Summarizing central tendency and dispersion to identify potential outliers (e.g., zero values in BMI or Blood Pressure).

Exploratory Data Analysis (EDA):

Correlation analysis using heatmaps to find which features most influence the Outcome.

Distribution analysis of health metrics.

Data Cleaning: Checking for null values and handling inconsistencies in the dataset.

Machine Learning:

Implementation of Logistic Regression.

Hyperparameter tuning using RandomizedSearchCV to find the best model estimators.

📈 Model Performance The model's effectiveness is evaluated using the Receiver Operating Characteristic (ROC) Curve.

Metric: Area Under the Curve (AUC).

Result: The notebook includes code to calculate the roc_auc score and plot the curve, demonstrating the model's ability to distinguish between diabetic and non-diabetic patients.

🖼 Visualizations The project generates several key plots:

Heatmaps: To visualize the correlation matrix.

ROC Curve: To visualize the trade-off between the true positive rate and false positive rate at various threshold settings.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project.ipynb		Project.ipynb
README.md		README.md
Scaler.pkl		Scaler.pkl
app.py		app.py
best_log_reg_model.pkl		best_log_reg_model.pkl
diabetes.csv		diabetes.csv
random_search_log_reg_model.pkl		random_search_log_reg_model.pkl
requirements.txt		requirements.txt
scaler.pkl		scaler.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages