This project aims to develop a predictive model to estimate the likelihood of a customer making a claim on their car insurance during the policy period. Commissioned by a fictional car insurance company, the project seeks to optimize pricing strategies and enhance risk assessment capabilities, crucial for maintaining a competitive edge in the large car insurance market.
The dataset, car_insurance.csv, includes various customer attributes:
| Column | Description |
|---|---|
id |
Unique client identifier |
age |
Client's age:
|
gender |
Client's gender:
|
driving_experience |
Years the client has been driving:
|
education |
Client's level of education:
|
income |
Client's income level:
|
credit_score |
Client's credit score (between zero and one) |
vehicle_ownership |
Client's vehicle ownership status:
|
vehcile_year |
Year of vehicle registration:
|
married |
Client's marital status:
|
children |
Client's number of children |
postal_code |
Client's postal code |
annual_mileage |
Number of miles driven by the client each year |
vehicle_type |
Type of car:
|
speeding_violations |
Total number of speeding violations received by the client |
duis |
Number of times the client has been caught driving under the influence of alcohol |
past_accidents |
Total number of previous accidents the client has been involved in |
outcome |
Whether the client made a claim on their car insurance (response variable):
|
-
Data Cleaning and Transformation: Handled missing values and transformed categorical variables into numerical representations.
-
Exploratory Data Analysis (EDA): Analyzed correlations and visualized data distributions to understand feature relationships.
-
Feature Selection: Applied logistic regression to assess each feature's predictive power.
-
Modeling:
- Logistic Regression: Tuned using GridSearchCV, achieving an accuracy of 79%.
- Random Forest: Tuned with parameters
max_depth: None,min_samples_split: 2,n_estimators: 50, achieving an accuracy of 82%.
-
Evaluation: Used confusion matrices and classification reports to evaluate model performance.
- Best Feature:
driving_experiencewas identified as the most predictive feature with an accuracy of 77.71%. - Model Performance:
- Logistic Regression: Precision of 0.92 for "No Claim" and 0.63 for "Claim".
- Random Forest: Precision of 0.85 for "No Claim" and 0.70 for "Claim".
Confusion matrices and correlation heatmaps are included to provide visual insights into model performance and feature relationships.
The project successfully identified key predictive features and developed models that enhance decision-making processes. The Random Forest model, with its higher accuracy, is recommended for deployment, providing a balance between simplicity and performance. This approach allows the company to start with a straightforward model in production, minimizing the need for complex infrastructure and expertise.
To run this project, ensure you have the following libraries installed:
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
Clone the repository and run the Jupyter Notebook to explore the analysis and predictions.
This project is licensed under the MIT License - see the LICENSE file for details.