Customer Churn Prediction

Project Overview

This project focuses on predicting customer churn for a subscription-based service using machine learning. By analyzing customer demographics, financial information, and behavioral metrics, we aim to identify customers likely to discontinue the service. Early identification allows businesses to take proactive retention measures.

Dataset

The dataset contains 505,207 records with 12 features. It includes a mix of categorical and numerical data.

Source: Kaggle - Customer Churn Dataset
Target Variable: Churn (1 = Churned, 0 = Retained)

Key Features

Demographics: Age, Gender
Account Info: Tenure, Subscription Type, Contract Length
Behavioral: Usage Frequency, Support Calls, Last Interaction
Financial: Payment Delay, Total Spend

Methodology

1. Data Analysis and Preprocessing

Data Cleaning: Removed duplicate records and handled missing values using median imputation for numerical features and mode imputation for categorical ones.
Feature Engineering:
- Normalization: Applied Z-score normalization (StandardScaler) to numerical features to handle outliers and scale data to a standard range.
- Encoding: Used One-Hot Encoding for categorical variables (Subscription Type, Contract Length).
Feature Selection: Recursive Feature Elimination (RFE) was used to identify the most predictive features. Weak predictors like Usage Frequency and Tenure (in some contexts) were flagged for removal to simplify the model.

2. Handling Class Imbalance

To ensure efficient training and fair evaluation:

Stratified Subsampling: A balanced subset of 50,000 records was initially taken.
SMOTE (Synthetic Minority Over-sampling Technique): Applied to the training data to perfectly balance the class distribution (50/50 split between Churn and Non-Churn), ensuring models don't prioritize the majority class.

3. Model Training and Hyperparameter Tuning

Three supervised learning models were trained and tuned using GridSearchCV with 5-fold cross-validation:

Logistic Regression:
- Optimized parameters: C=0.01, penalty='l1', solver='saga'.
K-Nearest Neighbors (KNN):
- Optimized parameters: n_neighbors=7, metric='euclidean', weights='distance'.
Support Vector Machine (SVM):
- Optimized parameters: C=10, gamma='scale', kernel='rbf'.

Results & Evaluation Report

Performance Metrics

The models were evaluated on a held-out test set. Below is the summary of performance:

Model	Accuracy	Precision (0/1)	Recall (0/1)	F1-Score (0/1)	ROC-AUC
Logistic Regression	84.77%	0.83 / 0.87	0.88 / 0.82	0.85 / 0.84	0.848
K-Nearest Neighbors	89.48%	0.89 / 0.90	0.90 / 0.89	0.90 / 0.89	0.895
Support Vector Machine	91.05%	0.95 / 0.88	0.87 / 0.95	0.91 / 0.91	0.910

Training vs. Testing Accuracy (Generalization)

Logistic Regression: Train (84.53%) vs Test (84.77%) - Good generalization, slight underfitting.
KNN: Train (100.00%) vs Test (89.48%) - Significant overfitting. The model memorized the training data.
SVM: Train (91.47%) vs Test (91.05%) - Excellent generalization. Minimal gap (0.42%) indicates a robust model.

detailed Findings

Logistic Regression serves as a decent baseline but struggles to capture complex non-linear relationships, resulting in the lowest accuracy.
KNN performed well but is prone to overfitting this dataset, making it less reliable for unseen data without further tuning or dimensionality reduction.
SVM outperformed the others, achieving the highest accuracy and ROC-AUC score. It effectively separated the classes with a robust margin.

Conclusion and Recommendation

Based on the comprehensive evaluation, Support Vector Machine (SVM) is the recommended model for deployment.

Highest Accuracy: 91.05%
Robustness: Low variance between training and testing scores.
Reliability: High F1-scores for both churned and retained classes ensure balanced performance.

How to Run

Ensure you have the required libraries installed: pandas, numpy, matplotlib, seaborn, scikit-learn.
Open the Jupyter Notebook Customer_Churn_Prediction.ipynb.
Run all cells to reproduce the analysis, training, and evaluation steps.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Customer_Churn_Prediction.ipynb		Customer_Churn_Prediction.ipynb
Customer_Churn_Report.pdf		Customer_Churn_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Churn Prediction

Project Overview

Dataset

Key Features

Methodology

1. Data Analysis and Preprocessing

2. Handling Class Imbalance

3. Model Training and Hyperparameter Tuning

Results & Evaluation Report

Performance Metrics

Training vs. Testing Accuracy (Generalization)

detailed Findings

Conclusion and Recommendation

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Prediction

Project Overview

Dataset

Key Features

Methodology

1. Data Analysis and Preprocessing

2. Handling Class Imbalance

3. Model Training and Hyperparameter Tuning

Results & Evaluation Report

Performance Metrics

Training vs. Testing Accuracy (Generalization)

detailed Findings

Conclusion and Recommendation

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages