A comprehensive machine learning project for predicting customer churn using telecom customer data. This project implements a complete ML pipeline from data preprocessing to model deployment.
Customer churn prediction is crucial for businesses to identify customers who are likely to cancel their services. This project uses machine learning techniques to predict customer churn based on various customer attributes and usage patterns.
- Complete ML Pipeline: From raw data to trained model
- Advanced Feature Engineering: Including tenure grouping and categorical encoding
- Model Optimization: Hyperparameter tuning using RandomizedSearchCV
- Comprehensive Evaluation: Multiple metrics and visualizations
- Production Ready: Modular code structure with proper logging
The project uses a telecom customer churn dataset with the following features:
- Customer Demographics: Gender, SeniorCitizen, Partner, Dependents
- Account Information: Tenure, Contract, PaperlessBilling, PaymentMethod
- Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, etc.
- Charges: MonthlyCharges, TotalCharges
- Target: Churn (Yes/No)
churn-prediction/
βββ config/
β βββ config.py # Configuration settings
β βββ params.yaml # Parameter file
βββ data/
β βββ 00_raw/ # Raw data
β βββ 01_interim/ # Intermediate processed data
β βββ 02_processed/ # Final processed data
β βββ 03_predictions/ # Model predictions
βββ models/ # Trained models and artifacts
βββ notebooks/ # Jupyter notebooks for exploration
βββ reports/
β βββ figures/ # Generated plots and visualizations
βββ src/
β βββ data/
β β βββ make_dataset.py # Data loading and cleaning
β βββ features/
β β βββ build_features.py # Feature engineering
β βββ models/
β β βββ train_model.py # Model training
β β βββ predict_model.py # Model prediction
β βββ visualization/
β βββ visualize.py # Visualization functions
βββ tests/ # Unit tests
βββ main.py # Main pipeline script
βββ requirements.txt # Python dependencies
βββ README.md # This file
# Clone the repository
git clone https://github.com/metedogan/churn-prediction.git
cd churn-prediction
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r pyproject.tomlPlace your raw data file in data/00_raw/telecom_customer_churn.xls
# Run the complete pipeline
python main.py
# Or run specific steps
python main.py --step data # Data processing only
python main.py --step train # Training only
python main.py --step predict # Prediction only
python main.py --step visualize # Visualization only- Load raw customer data
- Handle missing values in TotalCharges
- Remove duplicates
- Clean column names
- Transform target variable to binary
- Create tenure groups from numerical tenure
- Split data into train/test sets
- Apply StandardScaler to numerical features
- Apply OneHotEncoder to categorical features
- Save preprocessor for future use
- Train baseline Gradient Boosting model
- Perform hyperparameter tuning using RandomizedSearchCV
- Evaluate model performance
- Save best model and evaluation results
- Load trained model and preprocessor
- Make predictions on new data
- Generate prediction probabilities
- Export results
- Feature importance plots
- Confusion matrix
- Churn distribution analysis
- Feature vs churn relationships
- Model performance comparisons
The project uses a centralized configuration system in config/config.py:
# Model configuration
MODEL_CONFIG = {
'random_state': 42,
'test_size': 0.2,
'cv_folds': 5,
'hyperparameter_search_iterations': 50,
'target_column': 'Churn'
}The final Gradient Boosting model achieves:
- Accuracy: ~80%
- F1-Score: ~79%
- Precision: ~67% (for churn class)
- Recall: ~54% (for churn class)
Key insights from feature importance:
- Monthly Charges: Most important predictor
- Total Charges: Strong indicator of customer value
- Tenure: Customer loyalty metric
- Contract Type: Month-to-month contracts show higher churn
from src.models.predict_model import ChurnPredictor
# Initialize predictor
predictor = ChurnPredictor(
model_path='models/final_gradient_boosting_model.pkl',
preprocessor_path='data/02_processed/preprocessor.pkl'
)
# Make predictions
predictions = predictor.predict(new_customer_data)
probabilities = predictor.predict_proba(new_customer_data)
# Get feature importance
importance_df = predictor.get_feature_importance()from src.visualization.visualize import plot_feature_importance, plot_confusion_matrix
# Plot feature importance
plot_feature_importance(model, feature_names, top_n=15)
# Plot confusion matrix
plot_confusion_matrix(y_true, y_pred, labels=['No Churn', 'Churn'])The project includes comprehensive tests covering all modules:
tests/
βββ conftest.py # Shared fixtures and configuration
βββ test_basic_functionality.py # End-to-end and integration tests
βββ test_data.py # Data processing tests
βββ test_features.py # Feature engineering tests
βββ test_models.py # Model training and prediction tests
βββ test_visualization.py # Visualization tests
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_data.py
# Run tests with coverage
pytest --cov=src
# Skip slow tests (hyperparameter tuning, etc.)
pytest -m "not slow"
# Skip visualization tests (useful in headless environments)
pytest -m "not visualization"
# Run only integration tests
pytest -m integration- Unit Tests: Test individual functions and methods
- Integration Tests: Test module interactions and end-to-end workflows
- Data Tests: Validate data processing and cleaning
- Model Tests: Verify model training, prediction, and evaluation
- Visualization Tests: Check plot generation (may be skipped in headless environments)
The project follows PEP 8 standards. Format code using:
black src/
flake8 src/- Add feature engineering logic to
src/features/build_features.py - Update configuration in
config/config.py - Add tests in
tests/ - Update documentation
- Try different algorithms (XGBoost, LightGBM, Neural Networks)
- Implement ensemble methods
- Add more sophisticated feature engineering
- Use advanced hyperparameter optimization (Optuna, Hyperopt)
- Python 3.13+
- Core dependencies defined in
pyproject.toml:- pandas >= 2.3.2
- scikit-learn >= 1.7.2
- matplotlib >= 3.10.6
- seaborn >= 0.13.2
- xgboost >= 3.0.5
- lightgbm >= 4.6.0
- shap >= 0.48.0
Install with: pip install -e . or uv sync
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset source: Telecom Customer Churn Dataset
- Inspired by industry best practices in MLOps
- Built using scikit-learn and pandas ecosystems
For questions or suggestions, please open an issue or contact the maintainers.
Note: This project is designed for educational and demonstration purposes. For production use, consider additional factors like data privacy, model monitoring, and deployment infrastructure.