A comprehensive machine learning project that predicts student math scores using demographic and academic features. The system implements a complete MLOps pipeline from data ingestion to model deployment with a user-friendly web interface.
- End-to-End ML Pipeline: Complete workflow from data ingestion to model deployment
- Multiple Algorithm Comparison: Tests 7 different regression algorithms with hyperparameter tuning
- Real-time Predictions: Flask web application for instant score predictions
- Automated Model Selection: Automatically selects the best performing model based on R² score
- Data Preprocessing: Handles categorical encoding and feature scaling
- Modular Architecture: Well-structured codebase with separate components for each pipeline stage
student_performance/
├── artifacts/                    # Stored models and preprocessors
│   ├── model.pkl                # Trained ML model
│   ├── preprocessor.pkl         # Data preprocessing pipeline
│   ├── train.csv               # Training dataset
│   ├── test.csv                # Testing dataset
│   └── data.csv                # Raw dataset
├── notebook/
│   └── data/
│       └── stud.csv            # Original dataset
├── src/
│   ├── components/             # Core ML pipeline components
│   │   ├── __init__.py
│   │   ├── data_ingestion.py   # Data loading and splitting
│   │   ├── data_transformation.py  # Data preprocessing
│   │   └── model_trainer.py    # Model training and selection
│   ├── pipeline/               # Prediction pipeline
│   │   ├── __init__.py
│   │   └── predict_pipeline.py # Inference pipeline
│   ├── __init__.py
│   ├── exception.py            # Custom exception handling
│   ├── logger.py              # Logging configuration
│   └── utils.py               # Utility functions
├── templates/                  # HTML templates for web app
│   ├── index.html             # Homepage template
│   └── home.html              # Prediction form template
├── app.py                     # Flask web application
├── requirements.txt           # Project dependencies
└── README.md                  # Project documentation
- Loads student performance dataset
- Splits data into training (80%) and testing (20%) sets
- Saves processed datasets to artifacts folder
- Handles categorical variables (gender, ethnicity, education level, etc.)
- Applies feature scaling using StandardScaler
- Creates preprocessing pipeline for consistent data transformation
The system evaluates multiple regression algorithms:
- Random Forest Regressor
- Decision Tree Regressor
- Gradient Boosting Regressor
- Linear Regression
- XGBoost Regressor
- CatBoost Regressor
- AdaBoost Regressor
Each model undergoes hyperparameter tuning using GridSearchCV to find optimal parameters.
- Automatically selects the best performing model based on R² score
- Requires minimum R² score of 0.6 for model acceptance
- Saves the best model for production use
- Gender: Male/Female
- Race/Ethnicity: Student's ethnic background
- Parental Level of Education: Education level of parents
- Lunch: Standard or free/reduced lunch
- Test Preparation Course: Completed or not completed
- Reading Score: Student's reading test score
- Writing Score: Student's writing test score
- Math Score Prediction: Predicted math test score (0-100)
- Python 3.8
- Scikit-learn: Machine learning algorithms and preprocessing
- XGBoost & CatBoost: Advanced boosting algorithms
- Flask: Web application framework
- Pandas & NumPy: Data manipulation and analysis
- HTML/CSS: Frontend interface
The system automatically selects the best performing model based on R² score evaluation on test data, ensuring reliable predictions for student math performance.