This project focuses on predicting customer churn for a subscription-based service using machine learning. By analyzing customer demographics, financial information, and behavioral metrics, we aim to identify customers likely to discontinue the service. Early identification allows businesses to take proactive retention measures.
The dataset contains 505,207 records with 12 features. It includes a mix of categorical and numerical data.
- Source: Kaggle - Customer Churn Dataset
- Target Variable:
Churn(1 = Churned, 0 = Retained)
- Demographics: Age, Gender
- Account Info: Tenure, Subscription Type, Contract Length
- Behavioral: Usage Frequency, Support Calls, Last Interaction
- Financial: Payment Delay, Total Spend
- Data Cleaning: Removed duplicate records and handled missing values using median imputation for numerical features and mode imputation for categorical ones.
- Feature Engineering:
- Normalization: Applied Z-score normalization (StandardScaler) to numerical features to handle outliers and scale data to a standard range.
- Encoding: Used One-Hot Encoding for categorical variables (
Subscription Type,Contract Length).
- Feature Selection: Recursive Feature Elimination (RFE) was used to identify the most predictive features. Weak predictors like
Usage FrequencyandTenure(in some contexts) were flagged for removal to simplify the model.
To ensure efficient training and fair evaluation:
- Stratified Subsampling: A balanced subset of 50,000 records was initially taken.
- SMOTE (Synthetic Minority Over-sampling Technique): Applied to the training data to perfectly balance the class distribution (50/50 split between Churn and Non-Churn), ensuring models don't prioritize the majority class.
Three supervised learning models were trained and tuned using GridSearchCV with 5-fold cross-validation:
- Logistic Regression:
- Optimized parameters:
C=0.01,penalty='l1',solver='saga'.
- Optimized parameters:
- K-Nearest Neighbors (KNN):
- Optimized parameters:
n_neighbors=7,metric='euclidean',weights='distance'.
- Optimized parameters:
- Support Vector Machine (SVM):
- Optimized parameters:
C=10,gamma='scale',kernel='rbf'.
- Optimized parameters:
The models were evaluated on a held-out test set. Below is the summary of performance:
| Model | Accuracy | Precision (0/1) | Recall (0/1) | F1-Score (0/1) | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 84.77% | 0.83 / 0.87 | 0.88 / 0.82 | 0.85 / 0.84 | 0.848 |
| K-Nearest Neighbors | 89.48% | 0.89 / 0.90 | 0.90 / 0.89 | 0.90 / 0.89 | 0.895 |
| Support Vector Machine | 91.05% | 0.95 / 0.88 | 0.87 / 0.95 | 0.91 / 0.91 | 0.910 |
- Logistic Regression: Train (84.53%) vs Test (84.77%) - Good generalization, slight underfitting.
- KNN: Train (100.00%) vs Test (89.48%) - Significant overfitting. The model memorized the training data.
- SVM: Train (91.47%) vs Test (91.05%) - Excellent generalization. Minimal gap (0.42%) indicates a robust model.
- Logistic Regression serves as a decent baseline but struggles to capture complex non-linear relationships, resulting in the lowest accuracy.
- KNN performed well but is prone to overfitting this dataset, making it less reliable for unseen data without further tuning or dimensionality reduction.
- SVM outperformed the others, achieving the highest accuracy and ROC-AUC score. It effectively separated the classes with a robust margin.
Based on the comprehensive evaluation, Support Vector Machine (SVM) is the recommended model for deployment.
- Highest Accuracy: 91.05%
- Robustness: Low variance between training and testing scores.
- Reliability: High F1-scores for both churned and retained classes ensure balanced performance.
- Ensure you have the required libraries installed:
pandas,numpy,matplotlib,seaborn,scikit-learn. - Open the Jupyter Notebook
Customer_Churn_Prediction.ipynb. - Run all cells to reproduce the analysis, training, and evaluation steps.