This project focuses on classifying individuals based on income levels using the Census Income dataset from the UCI Machine Learning Repository. The goal is to predict whether an individual's income exceeds $50,000 per year based on demographic and employment-related features.
The dataset used in this project is the Census Income Dataset (also known as the Adult dataset), which contains 48,842 records collected from the 1994 U.S. Census. It includes various attributes such as age, education, occupation, and hours worked per week.
- Dataset Source: UCI Machine Learning Repository
- Features: 14 categorical and numerical attributes
- Target Variable: Binary classification (
<=50Kor>50K)
- Data Preprocessing
- Handling missing values
- Encoding categorical variables
- Feature scaling
- Exploratory Data Analysis (EDA)
- Visualizing distributions of numerical and categorical variables
- Identifying correlations between features
- Model Training & Evaluation
- Models used: Random Forest, XGBoost, SVM and LightGBM
- Performance metrics: Accuracy, Precision, Recall, F1-score
- Hyperparameter tuning for best performance
- Deployment
- The final model (LightGBM) is saved as a pickle file (
lightgbm_income_classifier.pkl). - A Flask API is built for real-time income classification.
- The final model (LightGBM) is saved as a pickle file (
The deployment is handled using Flask, providing a REST API for income classification.
- Install dependencies:
pip install flask pandas numpy lightgbm
- Run the Flask application:
python app.py
- API Endpoints:
- Home Route:
http://127.0.0.1:5000/(Returns a welcome message) - Prediction Route:
http://127.0.0.1:5000/predict(Accepts JSON input and returns a prediction)
- Home Route:
{
"age": 39,
"education-num": 13,
"hours-per-week": 40,
"occupation": "Exec-managerial",
"marital-status": "Married-civ-spouse"
}{
"prediction": 1 # 1 indicates income >50K, 0 indicates income <=50K
}|-- data/
| |-- census-income.csv # Processed dataset
|
|-- models/
| |-- lightgbm_income_classifier.pkl # Trained model
|
|-- notebooks/
| |-- EDA_and_Modeling.ipynb # Exploratory Data Analysis & Model Training
|
|-- app.py # Flask API for deployment
|-- README.md # Project documentation
This project demonstrates a complete pipeline from data preprocessing to model deployment. The LightGBM model performed the best, and the Flask API allows real-time predictions. Future improvements may include deploying the model using FastAPI or Docker for scalability.
This project was developed as part of the Intellipaat Advanced Data Science & AI Program.
For any queries or feedback, feel free to reach out via GitHub issues.