Customer Churn Prediction

A comprehensive machine learning project for predicting customer churn using telecom customer data. This project implements a complete ML pipeline from data preprocessing to model deployment.

🎯 Project Overview

Customer churn prediction is crucial for businesses to identify customers who are likely to cancel their services. This project uses machine learning techniques to predict customer churn based on various customer attributes and usage patterns.

Key Features

Complete ML Pipeline: From raw data to trained model
Advanced Feature Engineering: Including tenure grouping and categorical encoding
Model Optimization: Hyperparameter tuning using RandomizedSearchCV
Comprehensive Evaluation: Multiple metrics and visualizations
Production Ready: Modular code structure with proper logging

📊 Dataset

The project uses a telecom customer churn dataset with the following features:

Customer Demographics: Gender, SeniorCitizen, Partner, Dependents
Account Information: Tenure, Contract, PaperlessBilling, PaymentMethod
Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, etc.
Charges: MonthlyCharges, TotalCharges
Target: Churn (Yes/No)

🏗️ Project Structure

churn-prediction/
├── config/
│   ├── config.py              # Configuration settings
│   └── params.yaml            # Parameter file
├── data/
│   ├── 00_raw/               # Raw data
│   ├── 01_interim/           # Intermediate processed data
│   ├── 02_processed/         # Final processed data
│   └── 03_predictions/       # Model predictions
├── models/                   # Trained models and artifacts
├── notebooks/               # Jupyter notebooks for exploration
├── reports/
│   └── figures/             # Generated plots and visualizations
├── src/
│   ├── data/
│   │   └── make_dataset.py   # Data loading and cleaning
│   ├── features/
│   │   └── build_features.py # Feature engineering
│   ├── models/
│   │   ├── train_model.py    # Model training
│   │   └── predict_model.py  # Model prediction
│   └── visualization/
│       └── visualize.py      # Visualization functions
├── tests/                   # Unit tests
├── main.py                  # Main pipeline script
├── requirements.txt         # Python dependencies
└── README.md               # This file

🚀 Quick Start

1. Installation

# Clone the repository
git clone https://github.com/metedogan/churn-prediction.git
cd churn-prediction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r pyproject.toml

2. Data Setup

Place your raw data file in data/00_raw/telecom_customer_churn.xls

3. Run the Pipeline

# Run the complete pipeline
python main.py

# Or run specific steps
python main.py --step data      # Data processing only
python main.py --step train     # Training only
python main.py --step predict   # Prediction only
python main.py --step visualize # Visualization only

📈 Pipeline Steps

1. Data Processing (`src/data/make_dataset.py`)

Load raw customer data
Handle missing values in TotalCharges
Remove duplicates
Clean column names
Transform target variable to binary

2. Feature Engineering (`src/features/build_features.py`)

Create tenure groups from numerical tenure
Split data into train/test sets
Apply StandardScaler to numerical features
Apply OneHotEncoder to categorical features
Save preprocessor for future use

3. Model Training (`src/models/train_model.py`)

Train baseline Gradient Boosting model
Perform hyperparameter tuning using RandomizedSearchCV
Evaluate model performance
Save best model and evaluation results

4. Prediction (`src/models/predict_model.py`)

Load trained model and preprocessor
Make predictions on new data
Generate prediction probabilities
Export results

5. Visualization (`src/visualization/visualize.py`)

Feature importance plots
Confusion matrix
Churn distribution analysis
Feature vs churn relationships
Model performance comparisons

🔧 Configuration

The project uses a centralized configuration system in config/config.py:

# Model configuration
MODEL_CONFIG = {
    'random_state': 42,
    'test_size': 0.2,
    'cv_folds': 5,
    'hyperparameter_search_iterations': 50,
    'target_column': 'Churn'
}

📊 Model Performance

The final Gradient Boosting model achieves:

Accuracy: ~80%
F1-Score: ~79%
Precision: ~67% (for churn class)
Recall: ~54% (for churn class)

Key insights from feature importance:

Monthly Charges: Most important predictor
Total Charges: Strong indicator of customer value
Tenure: Customer loyalty metric
Contract Type: Month-to-month contracts show higher churn

🎯 Usage Examples

Making Predictions on New Data

from src.models.predict_model import ChurnPredictor

# Initialize predictor
predictor = ChurnPredictor(
    model_path='models/final_gradient_boosting_model.pkl',
    preprocessor_path='data/02_processed/preprocessor.pkl'
)

# Make predictions
predictions = predictor.predict(new_customer_data)
probabilities = predictor.predict_proba(new_customer_data)

# Get feature importance
importance_df = predictor.get_feature_importance()

Creating Visualizations

from src.visualization.visualize import plot_feature_importance, plot_confusion_matrix

# Plot feature importance
plot_feature_importance(model, feature_names, top_n=15)

# Plot confusion matrix
plot_confusion_matrix(y_true, y_pred, labels=['No Churn', 'Churn'])

🧪 Testing

The project includes comprehensive tests covering all modules:

Test Structure

tests/
├── conftest.py              # Shared fixtures and configuration
├── test_basic_functionality.py  # End-to-end and integration tests
├── test_data.py             # Data processing tests
├── test_features.py         # Feature engineering tests
├── test_models.py           # Model training and prediction tests
└── test_visualization.py    # Visualization tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_data.py

# Run tests with coverage
pytest --cov=src

# Skip slow tests (hyperparameter tuning, etc.)
pytest -m "not slow"

# Skip visualization tests (useful in headless environments)
pytest -m "not visualization"

# Run only integration tests
pytest -m integration

Test Categories

Unit Tests: Test individual functions and methods
Integration Tests: Test module interactions and end-to-end workflows
Data Tests: Validate data processing and cleaning
Model Tests: Verify model training, prediction, and evaluation
Visualization Tests: Check plot generation (may be skipped in headless environments)

📝 Development

Code Style

The project follows PEP 8 standards. Format code using:

black src/
flake8 src/

Adding New Features

Add feature engineering logic to src/features/build_features.py
Update configuration in config/config.py
Add tests in tests/
Update documentation

Model Improvements

Try different algorithms (XGBoost, LightGBM, Neural Networks)
Implement ensemble methods
Add more sophisticated feature engineering
Use advanced hyperparameter optimization (Optuna, Hyperopt)

📋 Requirements

Python 3.13+
Core dependencies defined in pyproject.toml:
- pandas >= 2.3.2
- scikit-learn >= 1.7.2
- matplotlib >= 3.10.6
- seaborn >= 0.13.2
- xgboost >= 3.0.5
- lightgbm >= 4.6.0
- shap >= 0.48.0

Install with: pip install -e . or uv sync

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Dataset source: Telecom Customer Churn Dataset
Inspired by industry best practices in MLOps
Built using scikit-learn and pandas ecosystems

📞 Contact

For questions or suggestions, please open an issue or contact the maintainers.

Note: This project is designed for educational and demonstration purposes. For production use, consider additional factors like data privacy, model monitoring, and deployment infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
data		data
models		models
notebooks		notebooks
reports/figures		reports/figures
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_tests.py		run_tests.py
uv.lock		uv.lock

metedogan/churn-prediction

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Prediction

🎯 Project Overview

Key Features

📊 Dataset

🏗️ Project Structure

🚀 Quick Start

1. Installation

2. Data Setup

3. Run the Pipeline

📈 Pipeline Steps

1. Data Processing (src/data/make_dataset.py)

2. Feature Engineering (src/features/build_features.py)

3. Model Training (src/models/train_model.py)

4. Prediction (src/models/predict_model.py)

5. Visualization (src/visualization/visualize.py)