Skip to content

vicky16898/Customer-Retention-Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Customer Churn Prediction

Project Overview

This project focuses on predicting customer churn for a subscription-based service using machine learning. By analyzing customer demographics, financial information, and behavioral metrics, we aim to identify customers likely to discontinue the service. Early identification allows businesses to take proactive retention measures.

Dataset

The dataset contains 505,207 records with 12 features. It includes a mix of categorical and numerical data.

Key Features

  • Demographics: Age, Gender
  • Account Info: Tenure, Subscription Type, Contract Length
  • Behavioral: Usage Frequency, Support Calls, Last Interaction
  • Financial: Payment Delay, Total Spend

Methodology

1. Data Analysis and Preprocessing

  • Data Cleaning: Removed duplicate records and handled missing values using median imputation for numerical features and mode imputation for categorical ones.
  • Feature Engineering:
    • Normalization: Applied Z-score normalization (StandardScaler) to numerical features to handle outliers and scale data to a standard range.
    • Encoding: Used One-Hot Encoding for categorical variables (Subscription Type, Contract Length).
  • Feature Selection: Recursive Feature Elimination (RFE) was used to identify the most predictive features. Weak predictors like Usage Frequency and Tenure (in some contexts) were flagged for removal to simplify the model.

2. Handling Class Imbalance

To ensure efficient training and fair evaluation:

  • Stratified Subsampling: A balanced subset of 50,000 records was initially taken.
  • SMOTE (Synthetic Minority Over-sampling Technique): Applied to the training data to perfectly balance the class distribution (50/50 split between Churn and Non-Churn), ensuring models don't prioritize the majority class.

3. Model Training and Hyperparameter Tuning

Three supervised learning models were trained and tuned using GridSearchCV with 5-fold cross-validation:

  1. Logistic Regression:
    • Optimized parameters: C=0.01, penalty='l1', solver='saga'.
  2. K-Nearest Neighbors (KNN):
    • Optimized parameters: n_neighbors=7, metric='euclidean', weights='distance'.
  3. Support Vector Machine (SVM):
    • Optimized parameters: C=10, gamma='scale', kernel='rbf'.

Results & Evaluation Report

Performance Metrics

The models were evaluated on a held-out test set. Below is the summary of performance:

Model Accuracy Precision (0/1) Recall (0/1) F1-Score (0/1) ROC-AUC
Logistic Regression 84.77% 0.83 / 0.87 0.88 / 0.82 0.85 / 0.84 0.848
K-Nearest Neighbors 89.48% 0.89 / 0.90 0.90 / 0.89 0.90 / 0.89 0.895
Support Vector Machine 91.05% 0.95 / 0.88 0.87 / 0.95 0.91 / 0.91 0.910

Training vs. Testing Accuracy (Generalization)

  • Logistic Regression: Train (84.53%) vs Test (84.77%) - Good generalization, slight underfitting.
  • KNN: Train (100.00%) vs Test (89.48%) - Significant overfitting. The model memorized the training data.
  • SVM: Train (91.47%) vs Test (91.05%) - Excellent generalization. Minimal gap (0.42%) indicates a robust model.

detailed Findings

  • Logistic Regression serves as a decent baseline but struggles to capture complex non-linear relationships, resulting in the lowest accuracy.
  • KNN performed well but is prone to overfitting this dataset, making it less reliable for unseen data without further tuning or dimensionality reduction.
  • SVM outperformed the others, achieving the highest accuracy and ROC-AUC score. It effectively separated the classes with a robust margin.

Conclusion and Recommendation

Based on the comprehensive evaluation, Support Vector Machine (SVM) is the recommended model for deployment.

  • Highest Accuracy: 91.05%
  • Robustness: Low variance between training and testing scores.
  • Reliability: High F1-scores for both churned and retained classes ensure balanced performance.

How to Run

  1. Ensure you have the required libraries installed: pandas, numpy, matplotlib, seaborn, scikit-learn.
  2. Open the Jupyter Notebook Customer_Churn_Prediction.ipynb.
  3. Run all cells to reproduce the analysis, training, and evaluation steps.

About

A machine learning–driven Customer Churn Prediction system that builds an end-to-end pipeline using Logistic Regression, KNN, and SVM with SMOTE balancing and GridSearchCV tuning, achieving 91% accuracy and delivering reliable, business-ready churn insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors