Machine Learning Project: Predicting insurance claim probability using advanced classification techniques
- Overview
- Problem Statement
- Dataset
- Project Structure
- Models Implemented
- Results
- Installation
- Usage
- Key Findings
- Contributing
- Author
This project tackles the Enhanced Safe Driver Prediction Challenge, focusing on predicting the probability that an auto insurance policyholder will file a claim. Built on an improved version of the Porto Seguro dataset, this project emphasizes:
- โจ Smart Feature Engineering
- โ๏ธ Handling Severely Imbalanced Data (94.9% vs 5.1%)
- ๐ฏ Maximizing AUROC Performance
- ๐ Robust Cross-Validation
Best Model: CatBoost with Hyperparameter Tuning
- Kaggle Public Score:
0.64138๐ฅ - Training Time: 12.5 hours
- Key Success: Perfect generalization (CV score matched Kaggle score)
Insurance companies must assess risk to determine premiums and minimize financial losses. This project develops a machine learning classifier to:
- ๐ Predict claim probability based on policyholder and vehicle features
- ๐ฏ Achieve high AUROC for effective risk discrimination
- ๐ฐ Enable personalized premium pricing
- ๐ซ Reduce fraudulent claims
| Metric | Value |
|---|---|
| Training Samples | 296,209 |
| Test Samples | 126,948 |
| Features | 67 |
| Numeric Variables | 37 |
| Categorical Variables | 30 |
| Class Imbalance | 18.5:1 |
๐ Individual Variables (ps_ind_*)
๐ Car-Related Variables (ps_car_*)
๐บ๏ธ Regional Variables (ps_reg_*)
๐งฎ Calculated Variables (ps_calc_*)
โ๏ธ Engineered Features (feature1-8)
๐ฏ Target Variable (binary: 0/1)
โ ๏ธ Missing Data: Up to 69% in some features- โ๏ธ Severe Class Imbalance: 94.9% non-claims
- ๐ High Correlation: 21 variables flagged
- 0๏ธโฃ Zero-Inflation: Multiple variables
enhanced-safe-driver-prediction/
โ
โโโ ๐ kaggle.ipynb # Main training notebook
โโโ ๐ kaggle_report.pdf # Comprehensive project report
โ
โโโ ๐ Data/
โ โโโ train1.csv # Training dataset
โ โโโ test.csv # Test dataset
โ
โโโ ๐พ Models/
โ โโโ submission_CatBoost.csv # Winner! ๐
โ โโโ submission_RandomForest.csv
โ โโโ submission_AdaBoost.csv
โ โโโ submission_DecisionTree.csv
โ โโโ submission_KNN.csv
โ โโโ submission_NaiveBayes.csv
โ
โโโ ๐ Visualizations/
โ โโโ model_training_comparison.png
โ
โโโ ๐ README.md # This file
- โก Training Time: 1.86s
- ๐ Train AUROC: 0.6423
- ๐ฏ Kaggle Score: Not submitted
- ๐ญ Note: Fast baseline with independence assumption
- โก Training Time: 4.21s
- ๐ Train AUROC: 0.9240 (Highest!)
- ๐ฏ Kaggle Score: 0.50623 (Worst - Overfitting!)
โ ๏ธ Warning: Memorized training data
- โก Training Time: 12.50s
- ๐ Train AUROC: 0.6743
- ๐ฏ Kaggle Score: 0.57333
- ๐ Nodes: 1,023 | Leaves: 512
- โก Training Time: 48.85s
- ๐ Train AUROC: 0.9116
- ๐ฏ Kaggle Score: 0.59801
โ ๏ธ Issue: 34% performance drop (overfitting)
- โก Training Time: 341.44s
- ๐ Train AUROC: 0.6438
- ๐ฏ Kaggle Score: 0.63016 (3rd place)
- ๐ก Strength: Good with imbalanced data
- โก Training Time: 45,005s (12.5 hours)
- ๐ Train AUROC: 0.6383 (CV)
- ๐ฏ Kaggle Score: 0.64138 (BEST!)
- ๐จ Parameters: 243 combinations ร 3 folds = 729 fits
- โจ Key: Perfect generalization (CV matched Kaggle)
{
'iterations': 500,
'learning_rate': 0.03,
'depth': 6,
'l2_leaf_reg': 5,
'border_count': 32,
'class_weights': [1, 5]
}| Rank | Model | Kaggle Score | Training AUROC | Gap |
|---|---|---|---|---|
| ๐ฅ 1 | CatBoost | 0.64138 | 0.6383 | +0.5% โ |
| ๐ฅ 2 | CatBoost v2 | 0.63825 | 0.6383 | ยฑ0.0% |
| ๐ฅ 3 | AdaBoost | 0.63016 | 0.6438 | -2.1% |
| 4 | Decision Tree | 0.57333 | 0.6743 | -15.0% |
| 5 | Random Forest | 0.59801 | 0.9116 | -34.4% |
| 6 | KNN | 0.50623 | 0.9240 | -45.2% ๐ซ |
| Rank | Feature | Importance | Category |
|---|---|---|---|
| 1 | ps_ind_03 |
9.3370 | Individual |
| 2 | ps_car_13 |
7.1361 | Car |
| 3 | ps_reg_01 |
4.9687 | Regional |
| 4 | ps_ind_15 |
4.6452 | Individual |
| 5 | ps_reg_02 |
3.6249 | Regional |
| 6 | ps_ind_05_cat_0.0 |
3.5811 | Categorical |
| 7 | ps_ind_17_bin |
3.3785 | Binary |
| 8 | ps_reg_03 |
3.1532 | Regional |
| 9 | feature4 |
2.5495 | Engineered |
| 10 | ps_car_14 |
2.4677 | Car |
Python 3.10+
Jupyter Notebookpip install numpy pandas scikit-learn matplotlib seaborn
pip install catboost xgboost lightgbm
pip install jupyter notebookgit clone https://github.com/yourusername/safe-driver-prediction.git
cd safe-driver-prediction# Load data
import pandas as pd
train = pd.read_csv('train1.csv')
test = pd.read_csv('test.csv')
# Check dimensions
print(f"Training: {train.shape}")
print(f"Testing: {test.shape}")jupyter notebook kaggle.ipynbAll models generate submission files:
submission_CatBoost.csv # Best model
submission_RandomForest.csv
submission_AdaBoost.csv
submission_DecisionTree.csv
submission_KNN.csv
kaggle competitions submit -c [competition-name] -f submission_CatBoost.csv -m "CatBoost submission"KNN: 0.924 training โ 0.506 Kaggle (-45% drop!)
CatBoost: 0.638 CV โ 0.641 Kaggle (+0.5% gain!)
Lesson: Never trust training metrics without proper cross-validation.
Initially criticized settings proved optimal:
l2_leaf_reg=5(max regularization)learning_rate=0.03(slow learning)- Prevented overfitting that destroyed Random Forest
- 12.5 hours training โ 1st place
- 729 model fits prevented overfitting
- Thoroughness beats speed in competitions
class_weights = [1, 5] # 5x weight for minority classEssential for AUROC performance on imbalanced data.
# 1. Handle Missing Values
cat_imputer = SimpleImputer(strategy='most_frequent')
num_imputer = SimpleImputer(strategy='mean')
# 2. Drop High-Missing Features
drop_cols = ['ps_car_03_cat', 'ps_car_05_cat'] # 69%, 45% missing
# 3. One-Hot Encoding
encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[cat_cols])
# 4. Feature Scaling (for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)
# 5. Final Dataset
X_final = pd.concat([X_encoded, X_numeric, X_binary], axis=1)from sklearn.metrics import roc_auc_score
auroc = roc_auc_score(y_true, y_pred_proba)- โ Handles class imbalance
- โ Threshold-independent
- โ Measures discrimination ability
- โ Accuracy misleading (94% by predicting all zeros!)
param_grid = {
'iterations': [300, 500, 700],
'learning_rate': [0.03, 0.05, 0.1],
'depth': [4, 6, 8],
'l2_leaf_reg': [1, 3, 5],
'border_count': [32, 64, 128]
}
# 3ร3ร3ร3ร3 = 243 combinations
# 243 ร 3 folds = 729 model fitsGridSearchCV(
estimator=catboost_model,
param_grid=param_grid,
cv=3, # 3-fold CV
scoring='roc_auc',
n_jobs=-1
)Contributions welcome! Please follow these steps:
- Fork the repository
- Create feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add improvement') - Push to branch (
git push origin feature/improvement) - Open Pull Request
Sagar Lekhraj
- ๐ ERP: 29325
- ๐ซ Institution: IBA Karachi
- ๐ง Email: [s.sagar.29325@khi.iba.edu.pk]
- ๐ LinkedIn: [Your LinkedIn Profile]
- ๐ป GitHub: @yourusername
Course: CSE 472 - Introduction to Machine Learning
Instructor: Dr. Sajjad Haider, PhD
Department: Computer Science
- Porto Seguro Safe Driver Prediction Dataset
- Scikit-learn Documentation
- CatBoost Official Documentation
- Kaggle Competition Guidelines
This project is licensed under the MIT License - see the LICENSE file for details.
- IBA Karachi Computer Science Department
- Dr. Sajjad Haider for course guidance
- Kaggle community for inspiration
- CatBoost team for excellent documentation
Made with โค๏ธ and โ by Sagar Lekhraj
Week 1: Data Exploration & EDA
Week 2: Preprocessing & Feature Engineering
Week 3: Baseline Models (Naive Bayes, KNN, Decision Tree)
Week 4: Ensemble Methods (Random Forest, AdaBoost)
Week 5: CatBoost Hyperparameter Tuning (12.5 hours!)
Week 6: Final Submission & Report
- Implement SMOTE for better class balance
- Try XGBoost and LightGBM
- Deep learning approaches (Neural Networks)
- Ensemble stacking of top models
- Feature selection optimization
- Advanced feature engineering
- Bayesian optimization for hyperparameters
Last Updated: November 2024
