Skip to content

A machine learning coursework repository featuring data preprocessing, model training, hyperparameter tuning, and leaderboard submissions for a Kaggle competition.

Notifications You must be signed in to change notification settings

Sagarlekhraj-19/Kaggle_competition_Course_work_Machine_learning

Repository files navigation

๐Ÿš— Enhanced Safe Driver Prediction Challenge

Machine Learning Project: Predicting insurance claim probability using advanced classification techniques

Python Jupyter Kaggle License


๐Ÿ“‹ Table of Contents


๐ŸŽฏ Overview

This project tackles the Enhanced Safe Driver Prediction Challenge, focusing on predicting the probability that an auto insurance policyholder will file a claim. Built on an improved version of the Porto Seguro dataset, this project emphasizes:

  • โœจ Smart Feature Engineering
  • โš–๏ธ Handling Severely Imbalanced Data (94.9% vs 5.1%)
  • ๐ŸŽฏ Maximizing AUROC Performance
  • ๐Ÿ”„ Robust Cross-Validation

๐Ÿ† Competition Achievement

Best Model: CatBoost with Hyperparameter Tuning

  • Kaggle Public Score: 0.64138 ๐Ÿฅ‡
  • Training Time: 12.5 hours
  • Key Success: Perfect generalization (CV score matched Kaggle score)

๐Ÿ’ก Problem Statement

Insurance companies must assess risk to determine premiums and minimize financial losses. This project develops a machine learning classifier to:

  • ๐Ÿ“Š Predict claim probability based on policyholder and vehicle features
  • ๐ŸŽฏ Achieve high AUROC for effective risk discrimination
  • ๐Ÿ’ฐ Enable personalized premium pricing
  • ๐Ÿšซ Reduce fraudulent claims

๐Ÿ“Š Dataset

Statistics

Metric Value
Training Samples 296,209
Test Samples 126,948
Features 67
Numeric Variables 37
Categorical Variables 30
Class Imbalance 18.5:1

Feature Categories

๐Ÿ“ Individual Variables (ps_ind_*)
๐Ÿš™ Car-Related Variables (ps_car_*)
๐Ÿ—บ๏ธ Regional Variables (ps_reg_*)
๐Ÿงฎ Calculated Variables (ps_calc_*)
โš™๏ธ Engineered Features (feature1-8)
๐ŸŽฏ Target Variable (binary: 0/1)

Data Quality Challenges

  • โš ๏ธ Missing Data: Up to 69% in some features
  • โš–๏ธ Severe Class Imbalance: 94.9% non-claims
  • ๐Ÿ”— High Correlation: 21 variables flagged
  • 0๏ธโƒฃ Zero-Inflation: Multiple variables

๐Ÿ“ Project Structure

enhanced-safe-driver-prediction/
โ”‚
โ”œโ”€โ”€ ๐Ÿ““ kaggle.ipynb                 # Main training notebook
โ”œโ”€โ”€ ๐Ÿ“„ kaggle_report.pdf           # Comprehensive project report
โ”‚
โ”œโ”€โ”€ ๐Ÿ“Š Data/
โ”‚   โ”œโ”€โ”€ train1.csv                 # Training dataset
โ”‚   โ””โ”€โ”€ test.csv                   # Test dataset
โ”‚
โ”œโ”€โ”€ ๐Ÿ’พ Models/
โ”‚   โ”œโ”€โ”€ submission_CatBoost.csv    # Winner! ๐Ÿ†
โ”‚   โ”œโ”€โ”€ submission_RandomForest.csv
โ”‚   โ”œโ”€โ”€ submission_AdaBoost.csv
โ”‚   โ”œโ”€โ”€ submission_DecisionTree.csv
โ”‚   โ”œโ”€โ”€ submission_KNN.csv
โ”‚   โ””โ”€โ”€ submission_NaiveBayes.csv
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ˆ Visualizations/
โ”‚   โ””โ”€โ”€ model_training_comparison.png
โ”‚
โ””โ”€โ”€ ๐Ÿ“– README.md                   # This file

๐Ÿค– Models Implemented

1. Categorical Naive Bayes

  • โšก Training Time: 1.86s
  • ๐Ÿ“Š Train AUROC: 0.6423
  • ๐ŸŽฏ Kaggle Score: Not submitted
  • ๐Ÿ’ญ Note: Fast baseline with independence assumption

2. K-Nearest Neighbors (k=5)

  • โšก Training Time: 4.21s
  • ๐Ÿ“Š Train AUROC: 0.9240 (Highest!)
  • ๐ŸŽฏ Kaggle Score: 0.50623 (Worst - Overfitting!)
  • โš ๏ธ Warning: Memorized training data

3. Decision Tree (depth=10)

  • โšก Training Time: 12.50s
  • ๐Ÿ“Š Train AUROC: 0.6743
  • ๐ŸŽฏ Kaggle Score: 0.57333
  • ๐Ÿ“‹ Nodes: 1,023 | Leaves: 512

4. Random Forest (100 trees)

  • โšก Training Time: 48.85s
  • ๐Ÿ“Š Train AUROC: 0.9116
  • ๐ŸŽฏ Kaggle Score: 0.59801
  • โš ๏ธ Issue: 34% performance drop (overfitting)

5. AdaBoost (100 estimators)

  • โšก Training Time: 341.44s
  • ๐Ÿ“Š Train AUROC: 0.6438
  • ๐ŸŽฏ Kaggle Score: 0.63016 (3rd place)
  • ๐Ÿ’ก Strength: Good with imbalanced data

6. CatBoost (Grid Search) ๐Ÿ†

  • โšก Training Time: 45,005s (12.5 hours)
  • ๐Ÿ“Š Train AUROC: 0.6383 (CV)
  • ๐ŸŽฏ Kaggle Score: 0.64138 (BEST!)
  • ๐ŸŽจ Parameters: 243 combinations ร— 3 folds = 729 fits
  • โœจ Key: Perfect generalization (CV matched Kaggle)

Optimal Hyperparameters

{
    'iterations': 500,
    'learning_rate': 0.03,
    'depth': 6,
    'l2_leaf_reg': 5,
    'border_count': 32,
    'class_weights': [1, 5]
}

๐Ÿ“ˆ Results

Final Kaggle Leaderboard

Rank Model Kaggle Score Training AUROC Gap
๐Ÿฅ‡ 1 CatBoost 0.64138 0.6383 +0.5% โœ…
๐Ÿฅˆ 2 CatBoost v2 0.63825 0.6383 ยฑ0.0%
๐Ÿฅ‰ 3 AdaBoost 0.63016 0.6438 -2.1%
4 Decision Tree 0.57333 0.6743 -15.0%
5 Random Forest 0.59801 0.9116 -34.4% โš ๏ธ
6 KNN 0.50623 0.9240 -45.2% ๐Ÿšซ

Performance Visualization

Model Comparison

Top 10 Important Features (CatBoost)

Rank Feature Importance Category
1 ps_ind_03 9.3370 Individual
2 ps_car_13 7.1361 Car
3 ps_reg_01 4.9687 Regional
4 ps_ind_15 4.6452 Individual
5 ps_reg_02 3.6249 Regional
6 ps_ind_05_cat_0.0 3.5811 Categorical
7 ps_ind_17_bin 3.3785 Binary
8 ps_reg_03 3.1532 Regional
9 feature4 2.5495 Engineered
10 ps_car_14 2.4677 Car

๐Ÿš€ Installation

Prerequisites

Python 3.10+
Jupyter Notebook

Required Libraries

pip install numpy pandas scikit-learn matplotlib seaborn
pip install catboost xgboost lightgbm
pip install jupyter notebook

Clone Repository

git clone https://github.com/yourusername/safe-driver-prediction.git
cd safe-driver-prediction

๐Ÿ’ป Usage

1. Data Preparation

# Load data
import pandas as pd
train = pd.read_csv('train1.csv')
test = pd.read_csv('test.csv')

# Check dimensions
print(f"Training: {train.shape}")
print(f"Testing: {test.shape}")

2. Run Training Pipeline

jupyter notebook kaggle.ipynb

3. Generate Predictions

All models generate submission files:

submission_CatBoost.csv      # Best model
submission_RandomForest.csv
submission_AdaBoost.csv
submission_DecisionTree.csv
submission_KNN.csv

4. Submit to Kaggle

kaggle competitions submit -c [competition-name] -f submission_CatBoost.csv -m "CatBoost submission"

๐Ÿ”‘ Key Findings

๐ŸŽฏ Critical Lessons

1. Training Scores Are Deceptive

KNN: 0.924 training โ†’ 0.506 Kaggle (-45% drop!)
CatBoost: 0.638 CV โ†’ 0.641 Kaggle (+0.5% gain!)

Lesson: Never trust training metrics without proper cross-validation.

2. Conservative Parameters Win

Initially criticized settings proved optimal:

  • l2_leaf_reg=5 (max regularization)
  • learning_rate=0.03 (slow learning)
  • Prevented overfitting that destroyed Random Forest

3. Time Investment Pays Off

  • 12.5 hours training โ†’ 1st place
  • 729 model fits prevented overfitting
  • Thoroughness beats speed in competitions

4. Class Imbalance Handling

class_weights = [1, 5]  # 5x weight for minority class

Essential for AUROC performance on imbalanced data.


๐Ÿงฎ Data Preprocessing Pipeline

# 1. Handle Missing Values
cat_imputer = SimpleImputer(strategy='most_frequent')
num_imputer = SimpleImputer(strategy='mean')

# 2. Drop High-Missing Features
drop_cols = ['ps_car_03_cat', 'ps_car_05_cat']  # 69%, 45% missing

# 3. One-Hot Encoding
encoder = OneHotEncoder(handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[cat_cols])

# 4. Feature Scaling (for KNN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_numeric)

# 5. Final Dataset
X_final = pd.concat([X_encoded, X_numeric, X_binary], axis=1)

๐Ÿ“Š Model Evaluation Metrics

Primary Metric: AUROC

from sklearn.metrics import roc_auc_score

auroc = roc_auc_score(y_true, y_pred_proba)

Why AUROC?

  • โœ… Handles class imbalance
  • โœ… Threshold-independent
  • โœ… Measures discrimination ability
  • โŒ Accuracy misleading (94% by predicting all zeros!)

๐ŸŽ“ Technical Highlights

Grid Search Configuration

param_grid = {
    'iterations': [300, 500, 700],
    'learning_rate': [0.03, 0.05, 0.1],
    'depth': [4, 6, 8],
    'l2_leaf_reg': [1, 3, 5],
    'border_count': [32, 64, 128]
}

# 3ร—3ร—3ร—3ร—3 = 243 combinations
# 243 ร— 3 folds = 729 model fits

Cross-Validation Strategy

GridSearchCV(
    estimator=catboost_model,
    param_grid=param_grid,
    cv=3,  # 3-fold CV
    scoring='roc_auc',
    n_jobs=-1
)

๐Ÿค Contributing

Contributions welcome! Please follow these steps:

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/improvement)
  3. Commit changes (git commit -am 'Add improvement')
  4. Push to branch (git push origin feature/improvement)
  5. Open Pull Request

๐Ÿ‘จโ€๐Ÿ’ป Author

Sagar Lekhraj

Course: CSE 472 - Introduction to Machine Learning
Instructor: Dr. Sajjad Haider, PhD
Department: Computer Science


๐Ÿ“š References

  1. Porto Seguro Safe Driver Prediction Dataset
  2. Scikit-learn Documentation
  3. CatBoost Official Documentation
  4. Kaggle Competition Guidelines

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • IBA Karachi Computer Science Department
  • Dr. Sajjad Haider for course guidance
  • Kaggle community for inspiration
  • CatBoost team for excellent documentation

โญ If you found this project helpful, please star the repository! โญ

Made with โค๏ธ and โ˜• by Sagar Lekhraj


๐Ÿ“… Project Timeline

Week 1: Data Exploration & EDA
Week 2: Preprocessing & Feature Engineering
Week 3: Baseline Models (Naive Bayes, KNN, Decision Tree)
Week 4: Ensemble Methods (Random Forest, AdaBoost)
Week 5: CatBoost Hyperparameter Tuning (12.5 hours!)
Week 6: Final Submission & Report

๐Ÿ”ฎ Future Work

  • Implement SMOTE for better class balance
  • Try XGBoost and LightGBM
  • Deep learning approaches (Neural Networks)
  • Ensemble stacking of top models
  • Feature selection optimization
  • Advanced feature engineering
  • Bayesian optimization for hyperparameters

Last Updated: November 2024

About

A machine learning coursework repository featuring data preprocessing, model training, hyperparameter tuning, and leaderboard submissions for a Kaggle competition.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published