This project analyzes customer data from a bank to predict which customers are likely to leave (churn). Customer churn is a critical business metric, as retaining existing customers is typically more cost-effective than acquiring new ones. By identifying customers at risk of leaving, banks can take proactive steps to improve retention.
The analysis uses the "Churn_Modelling.csv" dataset, obtained from open source Kaggle, which contains information about 10,000 bank customers with the following features:
- Basic customer information (Customer ID, Surname)
- Demographics (Credit Score, Geography, Gender, Age)
- Banking relationship (Tenure, Balance, Number of Products, Credit Card ownership, Active membership status)
- Financial data (Estimated Salary)
- Target variable: "Exited" (1 = customer left the bank, 0 = customer stayed)
- Python: Primary programming language for analysis
- Jupyter Notebook: Interactive development environment for code execution and documentation
- Pandas: Data manipulation and analysis library
- NumPy: Numerical computing library for mathematical operations
- Matplotlib & Seaborn: Data visualization libraries for creating plots and charts
- Scikit-learn: Machine learning library providing tools for predictive modeling
- RandomForestClassifier
- LogisticRegression
- SVC (Support Vector Classification)
- KNeighborsClassifier
- GradientBoostingClassifier
- StandardScaler for feature normalization
- Classification metrics (confusion_matrix, classification_report, accuracy_score)
- LabelEncoder: Tool for converting categorical data to numerical format
- Checked for missing values (none found)
- Verified no duplicate records exist
- Examined data types and structure
The analysis included several visualizations to understand patterns in customer churn:
- Overall Churn Distribution: Visual representation of how many customers stayed vs. left
- Churn by Geography: Comparison of churn rates across different countries
- Churn by Gender: Analysis of whether gender impacts likelihood to leave
- Age Distribution by Churn: Visualization showing how age relates to churn behavior
- Correlation Heatmap: Identification of relationships between numeric variables
- Financial Analysis: Examination of how balance and salary relate to churn
- Credit Score Analysis: Comparison of credit scores between churned and retained customers
Key insights from these visualizations help identify patterns that might predict customer churn.
Before building predictive models, the data was prepared through:
- Converting categorical variables (Gender, Geography) to numerical format
- Feature selection to choose relevant predictors
- Splitting data into training (80%) and testing (20%) sets
- Standardizing features to ensure all variables are on a similar scale
Several machine learning models were trained and compared:
-
Random Forest Classifier (87% accuracy)
- Strong overall performer
- Good balance of precision and recall
-
Logistic Regression (81% accuracy)
- Simple interpretable model
- Performed reasonably well but struggled with identifying churned customers
-
Support Vector Machine (SVM) (80% accuracy)
- Failed to identify any churned customers
- Good for identifying non-churned customers but not useful as a complete solution
-
K-Nearest Neighbors (KNN) (82% accuracy)
- Moderate performance
- Better than logistic regression at identifying churned customers
-
Gradient Boosting (87% accuracy)
- On par with Random Forest
- Strong overall performance
Each model was evaluated using:
- Confusion Matrix: Shows true positives, false positives, true negatives, and false negatives
- Classification Report: Details precision, recall, and F1-score for each class
- Accuracy Score: Overall percentage of correct predictions
Additional features were created to potentially improve model performance:
- BalanceZero: Flag for customers with zero balance
- AgeGroup: Categorized age into meaningful groups
- BalanceToSalaryRatio: Ratio of balance to salary
- ProductUsage: Interaction between number of products and active membership
- TenureGroup: Categorized tenure into groups
- Gender-Geography Interactions: Combined gender with country information
Despite these additional features, model accuracy remained similar at 87%, indicating that the original features already captured most of the predictive information.
The Random Forest model identified the most important features for predicting churn:
- Age
- Balance
- Estimated Salary
- Geography
- Active membership status
- The Random Forest and Gradient Boosting models performed best with 87% accuracy
- Age appears to be a significant factor in customer churn
- The models could identify customers who stay with high accuracy (96% recall) but were less effective at identifying customers who leave (46-49% recall)
- Feature engineering did not significantly improve model performance
- Develop retention strategies focused on older customers
- Pay special attention to customers with certain balance profiles
- Consider geography-specific retention programs
- Implement programs to increase active membership status
- Focus retention efforts on the specific customer segments identified by the model
- Python 3.x
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
- Ensure all required libraries are installed
- Place the "Churn_Modelling.csv" file in the same directory as the notebook
- Run the Jupyter notebook cells in sequence