Skip to content

This project develops predictive models to estimate car insurance claim likelihood using customer data. It utilizes logistic regression and random forest algorithms, with the Random Forest model achieving 82% accuracy. Visualizations are included to enhance understanding of model performance and feature relationships.

License

Notifications You must be signed in to change notification settings

evdimitriou/Modeling-Car-Insurances

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

Modeling Car Insurances

Overview

This project aims to develop a predictive model to estimate the likelihood of a customer making a claim on their car insurance during the policy period. Commissioned by a fictional car insurance company, the project seeks to optimize pricing strategies and enhance risk assessment capabilities, crucial for maintaining a competitive edge in the large car insurance market.

Table of Contents

Data Contents

The dataset, car_insurance.csv, includes various customer attributes:

Column Description
id Unique client identifier
age Client's age:
  • 0: 16-25
  • 1: 26-39
  • 2: 40-64
  • 3: 65+
gender Client's gender:
  • 0: Female
  • 1: Male
driving_experience Years the client has been driving:
  • 0: 0-9
  • 1: 10-19
  • 2: 20-29
  • 3: 30+
education Client's level of education:
  • 0: No education
  • 1: High school
  • 2: University
income Client's income level:
  • 0: Poverty
  • 1: Working class
  • 2: Middle class
  • 3: Upper class
credit_score Client's credit score (between zero and one)
vehicle_ownership Client's vehicle ownership status:
  • 0: Does not own their vehilce (paying off finance)
  • 1: Owns their vehicle
vehcile_year Year of vehicle registration:
  • 0: Before 2015
  • 1: 2015 or later
married Client's marital status:
  • 0: Not married
  • 1: Married
children Client's number of children
postal_code Client's postal code
annual_mileage Number of miles driven by the client each year
vehicle_type Type of car:
  • 0: Sedan
  • 1: Sports car
speeding_violations Total number of speeding violations received by the client
duis Number of times the client has been caught driving under the influence of alcohol
past_accidents Total number of previous accidents the client has been involved in
outcome Whether the client made a claim on their car insurance (response variable):
  • 0: No claim
  • 1: Made a claim

Methodology

  1. Data Cleaning and Transformation: Handled missing values and transformed categorical variables into numerical representations.

  2. Exploratory Data Analysis (EDA): Analyzed correlations and visualized data distributions to understand feature relationships.

  3. Feature Selection: Applied logistic regression to assess each feature's predictive power.

  4. Modeling:

    • Logistic Regression: Tuned using GridSearchCV, achieving an accuracy of 79%.
    • Random Forest: Tuned with parameters max_depth: None, min_samples_split: 2, n_estimators: 50, achieving an accuracy of 82%.
  5. Evaluation: Used confusion matrices and classification reports to evaluate model performance.

Results

  • Best Feature: driving_experience was identified as the most predictive feature with an accuracy of 77.71%.
  • Model Performance:
    • Logistic Regression: Precision of 0.92 for "No Claim" and 0.63 for "Claim".
    • Random Forest: Precision of 0.85 for "No Claim" and 0.70 for "Claim".

Visualizations

Confusion matrices and correlation heatmaps are included to provide visual insights into model performance and feature relationships.

Conclusion

The project successfully identified key predictive features and developed models that enhance decision-making processes. The Random Forest model, with its higher accuracy, is recommended for deployment, providing a balance between simplicity and performance. This approach allows the company to start with a straightforward model in production, minimizing the need for complex infrastructure and expertise.

Getting Started

To run this project, ensure you have the following libraries installed:

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn

Clone the repository and run the Jupyter Notebook to explore the analysis and predictions.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project develops predictive models to estimate car insurance claim likelihood using customer data. It utilizes logistic regression and random forest algorithms, with the Random Forest model achieving 82% accuracy. Visualizations are included to enhance understanding of model performance and feature relationships.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published