This project presents a complete end-to-end machine learning pipeline for predicting smoking behavior based on bio-signal and physiological health data. The solution leverages statistical analysis, feature engineering, and ensemble learning to classify individuals as smokers or non-smokers using real-world health metrics.
The objective of this project was to develop an intelligent classification system that can infer the smoking status of an individual based solely on their biological signals. This can assist healthcare professionals in non-invasive diagnosis and risk assessment without requiring a direct declaration of smoking habits.
The dataset comprises over 55,000 samples and 27 health-related features, including demographic details (age, gender), anthropometric measures (height, weight, waist), vital signs (blood pressure), vision, hearing, and multiple laboratory test results such as cholesterol levels, blood sugar, liver enzymes, and hemoglobin.
age,gender,height(cm),weight(kg),waist(cm)systolic,relaxation(blood pressure),fasting blood sugar,Cholesterol,HDL,LDL,triglyceridehemoglobin,AST,ALT,Gtp,serum creatinineoral,tartar,dental caries,eyesight(left/right),hearing(left/right)- Target variable:
smoking(binary classification - smoker or non-smoker)
- Removal of irrelevant features (
ID) - Handling of categorical variables (
gender,oral,tartar, etc.) - Conversion of binary string features (
Y/N) to numeric - One-hot encoding for categorical attributes
- Missing value analysis and treatment
- Univariate and bivariate distributions
- Visual insights on gender-wise and age-wise smoking patterns
- Boxplots for identifying outliers (true outliers retained due to natural variation)
- Employed
ExtraTreesClassifierto compute feature importances - Selected top 15 most relevant features based on importance scores to reduce dimensionality and enhance model performance
Implemented and evaluated the following models:
- Logistic Regression
- Decision Tree Classifier
- Bagging Classifier
- Random Forest Classifier
- Extra Trees Classifier
- Accuracy, Precision, Recall, and F1-Score computed for each model
- Bagging Classifier demonstrated the best performance with approximately 82.6% accuracy and strong precision/recall balance for the smoker class
- Final Bagging Classifier and StandardScaler objects were serialized using
joblibfor future inference and deployment
- Deployed a fully functional Gradio-based web interface that allows users to input health parameters and receive real-time classification output (Smoker or Non-Smoker)
- Language: Python 3
- Libraries: pandas, numpy, scikit-learn, seaborn, matplotlib, gradio, joblib
- Modeling: Ensemble Learning (Bagging, Random Forest, Extra Trees)
- Deployment: Gradio UI for interactive web-based inference
-
Clone this repository:
git clone https://github.com/surendiran-20cl/bio-signal-smoking-classification.git cd bio-signal-smoking-classification -
Install dependencies:
pip install -r requirements.txt
-
Train the model (if not already saved):
python train_model.py
-
Launch the Gradio app:
python gradio_app.py
- Health risk stratification tools
- Early warning systems in preventive healthcare
- Assistive diagnosis in digital health solutions
- Population-level health behavior studies
- Achieved over 82% accuracy with ensemble methods
- Created a compact feature space with only 15 top-ranked attributes
- Built a reusable and interactive Gradio interface for model inference
- Prepared for extension into deployment with API endpoints or cloud hosting
This project was developed as part of an Artificial Intelligence course capstone project to demonstrate applied knowledge in health informatics using machine learning.
This project is released under the Apache 2.0 License.