This project focuses on predicting the fuel efficiency (MPG - Miles Per Gallon) of automobiles using their technical specifications. It applies various data preprocessing techniques, regression algorithms, and ensemble methods to develop accurate and robust predictive models.
- Source: UCI Machine Learning Repository β Auto MPG Dataset
- Total Instances: 398
- Features Used:
- Cylinders
- Displacement
- Horsepower
- Weight
- Acceleration
- Model Year
- Origin
- Target Variable: MPG (Miles Per Gallon)
- Data Cleaning
- Handled 12 missing values in the
Horsepowercolumn using group-wise median imputation (based on Cylinder count).
- Handled 12 missing values in the
- Exploratory Data Analysis (EDA)
- Correlation matrix, clustermap, skewness analysis, and distribution plots.
- Outlier Removal
- Used IQR method on
HorsepowerandAcceleration.
- Used IQR method on
- Feature Engineering
- Log transformation applied to target (
MPG) due to skewness. - One-hot encoding on
CylindersandOrigin.
- Log transformation applied to target (
- Data Splitting & Scaling
- 10% training / 90% testing split to challenge the models.
- Applied
RobustScalerto minimize the influence of outliers.
- Modeling
- Applied and tuned:
- Linear Regression
- Ridge Regression (L2)
- Lasso Regression (L1)
- ElasticNet (L1 + L2)
- XGBoost Regressor
- Averaging Ensemble (Lasso + XGBoost)
- Applied and tuned:
- Performance Evaluation
- Mean Squared Error (MSE) used as the evaluation metric.
| Model | MSE | Notes |
|---|---|---|
| Linear Regression | 0.01363 | Baseline model |
| Ridge Regression | 0.01340 | Slightly improved with L2 regularization |
| Lasso Regression | 0.01331 β | Feature selection + low error |
| ElasticNet | 0.01330 β β | Best performance, balanced model |
| XGBoost | 0.01810 β | Overcomplicated for this small dataset |
| Averaged (Lasso + XGB) | 0.01365 | Balanced but not outperforming Lasso/ENet |
- ElasticNet and Lasso were the most effective models for this dataset.
- RobustScaler proved useful in stabilizing model performance.
- XGBoost, while powerful, underperformed due to dataset simplicity.
- Averaging models yielded consistent but not superior results.
- Python 3.10+
- NumPy, Pandas
- Seaborn, Matplotlib
- Scikit-learn
- XGBoost
- Jupyter Notebook / VSCode