Skip to content

metedogan/churn-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Customer Churn Prediction

A comprehensive machine learning project for predicting customer churn using telecom customer data. This project implements a complete ML pipeline from data preprocessing to model deployment.

🎯 Project Overview

Customer churn prediction is crucial for businesses to identify customers who are likely to cancel their services. This project uses machine learning techniques to predict customer churn based on various customer attributes and usage patterns.

Key Features

  • Complete ML Pipeline: From raw data to trained model
  • Advanced Feature Engineering: Including tenure grouping and categorical encoding
  • Model Optimization: Hyperparameter tuning using RandomizedSearchCV
  • Comprehensive Evaluation: Multiple metrics and visualizations
  • Production Ready: Modular code structure with proper logging

πŸ“Š Dataset

The project uses a telecom customer churn dataset with the following features:

  • Customer Demographics: Gender, SeniorCitizen, Partner, Dependents
  • Account Information: Tenure, Contract, PaperlessBilling, PaymentMethod
  • Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, etc.
  • Charges: MonthlyCharges, TotalCharges
  • Target: Churn (Yes/No)

πŸ—οΈ Project Structure

churn-prediction/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ config.py              # Configuration settings
β”‚   └── params.yaml            # Parameter file
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ 00_raw/               # Raw data
β”‚   β”œβ”€β”€ 01_interim/           # Intermediate processed data
β”‚   β”œβ”€β”€ 02_processed/         # Final processed data
β”‚   └── 03_predictions/       # Model predictions
β”œβ”€β”€ models/                   # Trained models and artifacts
β”œβ”€β”€ notebooks/               # Jupyter notebooks for exploration
β”œβ”€β”€ reports/
β”‚   └── figures/             # Generated plots and visualizations
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── make_dataset.py   # Data loading and cleaning
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   └── build_features.py # Feature engineering
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ train_model.py    # Model training
β”‚   β”‚   └── predict_model.py  # Model prediction
β”‚   └── visualization/
β”‚       └── visualize.py      # Visualization functions
β”œβ”€β”€ tests/                   # Unit tests
β”œβ”€β”€ main.py                  # Main pipeline script
β”œβ”€β”€ requirements.txt         # Python dependencies
└── README.md               # This file

πŸš€ Quick Start

1. Installation

# Clone the repository
git clone https://github.com/metedogan/churn-prediction.git
cd churn-prediction

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r pyproject.toml

2. Data Setup

Place your raw data file in data/00_raw/telecom_customer_churn.xls

3. Run the Pipeline

# Run the complete pipeline
python main.py

# Or run specific steps
python main.py --step data      # Data processing only
python main.py --step train     # Training only
python main.py --step predict   # Prediction only
python main.py --step visualize # Visualization only

πŸ“ˆ Pipeline Steps

1. Data Processing (src/data/make_dataset.py)

  • Load raw customer data
  • Handle missing values in TotalCharges
  • Remove duplicates
  • Clean column names
  • Transform target variable to binary

2. Feature Engineering (src/features/build_features.py)

  • Create tenure groups from numerical tenure
  • Split data into train/test sets
  • Apply StandardScaler to numerical features
  • Apply OneHotEncoder to categorical features
  • Save preprocessor for future use

3. Model Training (src/models/train_model.py)

  • Train baseline Gradient Boosting model
  • Perform hyperparameter tuning using RandomizedSearchCV
  • Evaluate model performance
  • Save best model and evaluation results

4. Prediction (src/models/predict_model.py)

  • Load trained model and preprocessor
  • Make predictions on new data
  • Generate prediction probabilities
  • Export results

5. Visualization (src/visualization/visualize.py)

  • Feature importance plots
  • Confusion matrix
  • Churn distribution analysis
  • Feature vs churn relationships
  • Model performance comparisons

πŸ”§ Configuration

The project uses a centralized configuration system in config/config.py:

# Model configuration
MODEL_CONFIG = {
    'random_state': 42,
    'test_size': 0.2,
    'cv_folds': 5,
    'hyperparameter_search_iterations': 50,
    'target_column': 'Churn'
}

πŸ“Š Model Performance

The final Gradient Boosting model achieves:

  • Accuracy: ~80%
  • F1-Score: ~79%
  • Precision: ~67% (for churn class)
  • Recall: ~54% (for churn class)

Key insights from feature importance:

  1. Monthly Charges: Most important predictor
  2. Total Charges: Strong indicator of customer value
  3. Tenure: Customer loyalty metric
  4. Contract Type: Month-to-month contracts show higher churn

🎯 Usage Examples

Making Predictions on New Data

from src.models.predict_model import ChurnPredictor

# Initialize predictor
predictor = ChurnPredictor(
    model_path='models/final_gradient_boosting_model.pkl',
    preprocessor_path='data/02_processed/preprocessor.pkl'
)

# Make predictions
predictions = predictor.predict(new_customer_data)
probabilities = predictor.predict_proba(new_customer_data)

# Get feature importance
importance_df = predictor.get_feature_importance()

Creating Visualizations

from src.visualization.visualize import plot_feature_importance, plot_confusion_matrix

# Plot feature importance
plot_feature_importance(model, feature_names, top_n=15)

# Plot confusion matrix
plot_confusion_matrix(y_true, y_pred, labels=['No Churn', 'Churn'])

πŸ§ͺ Testing

The project includes comprehensive tests covering all modules:

Test Structure

tests/
β”œβ”€β”€ conftest.py              # Shared fixtures and configuration
β”œβ”€β”€ test_basic_functionality.py  # End-to-end and integration tests
β”œβ”€β”€ test_data.py             # Data processing tests
β”œβ”€β”€ test_features.py         # Feature engineering tests
β”œβ”€β”€ test_models.py           # Model training and prediction tests
└── test_visualization.py    # Visualization tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_data.py

# Run tests with coverage
pytest --cov=src

# Skip slow tests (hyperparameter tuning, etc.)
pytest -m "not slow"

# Skip visualization tests (useful in headless environments)
pytest -m "not visualization"

# Run only integration tests
pytest -m integration

Test Categories

  • Unit Tests: Test individual functions and methods
  • Integration Tests: Test module interactions and end-to-end workflows
  • Data Tests: Validate data processing and cleaning
  • Model Tests: Verify model training, prediction, and evaluation
  • Visualization Tests: Check plot generation (may be skipped in headless environments)

πŸ“ Development

Code Style

The project follows PEP 8 standards. Format code using:

black src/
flake8 src/

Adding New Features

  1. Add feature engineering logic to src/features/build_features.py
  2. Update configuration in config/config.py
  3. Add tests in tests/
  4. Update documentation

Model Improvements

  • Try different algorithms (XGBoost, LightGBM, Neural Networks)
  • Implement ensemble methods
  • Add more sophisticated feature engineering
  • Use advanced hyperparameter optimization (Optuna, Hyperopt)

πŸ“‹ Requirements

  • Python 3.13+
  • Core dependencies defined in pyproject.toml:
    • pandas >= 2.3.2
    • scikit-learn >= 1.7.2
    • matplotlib >= 3.10.6
    • seaborn >= 0.13.2
    • xgboost >= 3.0.5
    • lightgbm >= 4.6.0
    • shap >= 0.48.0

Install with: pip install -e . or uv sync

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Dataset source: Telecom Customer Churn Dataset
  • Inspired by industry best practices in MLOps
  • Built using scikit-learn and pandas ecosystems

πŸ“ž Contact

For questions or suggestions, please open an issue or contact the maintainers.


Note: This project is designed for educational and demonstration purposes. For production use, consider additional factors like data privacy, model monitoring, and deployment infrastructure.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published