Credit Default Risk Prediction

Binary classification for loan default risk

This project implements a machine learning pipeline that predicts whether loan applicants will have payment difficulties. The solution supports credit risk assessment using application and loan data.

Problem Statement

The task is a binary classification problem: predict whether a person applying for home credit will be able to repay their debt.

Target 1: Client will have payment difficulties (late payment of more than X days on at least one of the first Y installments of the loan in the sample).
Target 0: Client will repay without such difficulties.

The model must output probabilities that a loan will not be paid for each applicant. The goal is to support lending decisions and default risk management.

Dataset Description

The dataset is based on the Home Credit Default Risk Kaggle competition. This project uses the primary application files only:

application_train_aai.csv: Training data with labels.
application_test_aai.csv: Test data (no labels); used for final predictions.

Dataset characteristics

Task: Binary classification.
Features: Applicant demographics, financial information, employment details, and housing/family information.
Class distribution: Imbalanced (typical for credit risk).
Data quality: Missing values, outliers (e.g. DAYS_EMPLOYED), and categorical variables that require encoding.

Data is downloaded automatically when running the notebook’s data-loading section.

Modeling Approach

Data preprocessing pipeline

Outlier correction
Anomalous values (e.g. DAYS_EMPLOYED = 365243) are replaced with NaN.
Categorical encoding
- Binary categoricals: OrdinalEncoder.
- Multi-category: OneHotEncoder (sparse_output=False, handle_unknown='ignore').
Missing value imputation
SimpleImputer with median strategy for all numerical features.
Feature scaling
MinMaxScaler to scale features to [0, 1].

Model architecture

The final model is a LightGBM-based pipeline:

VarianceThreshold (threshold=0.01): drop low-variance features.
SelectKBest (f_classif, k=75): keep top 75 features.
LGBMClassifier: n_estimators=200, max_depth=12, learning_rate=0.05, num_leaves=31, subsample=0.8, colsample_bytree=0.8, min_child_samples=20, class_weight='balanced'.

Model selection

Three approaches were compared:

Approach	Validation ROC AUC	Train–val gap
Random Forest (tuned)	0.7302	0.0768
LightGBM (basic)	0.7531	0.0425
LightGBM pipeline	0.7493	0.0417

The LightGBM pipeline was chosen for its balance of performance, generalization, and efficiency. Hyperparameters were tuned with RandomizedSearchCV (3-fold CV, scoring='roc_auc', 15 iterations, random_state=42).

Evaluation Metrics

Primary metric: ROC AUC (Area Under the ROC Curve)

ROC AUC is used because it is threshold-independent, suitable for imbalanced data, and reflects probability ranking quality. It is a standard choice for credit risk models.

Model performance

Model	Train ROC AUC	Validation ROC AUC	Train–val gap
Random Forest	0.8070	0.7302	0.0768
LightGBM (basic)	0.7956	0.7531	0.0425
LightGBM pipeline	0.7910	0.7493	0.0417

Results

Validation ROC AUC: 0.7493 for the selected LightGBM pipeline.
Generalization: Train–validation gap of 0.0417 indicates limited overfitting.
The pipeline (feature selection + LightGBM) is used as the final model for predictions.

How to Run

Prerequisites

Python 3.8 or higher
pip
Git (for cloning)

Step 1: Clone the repository

git clone <your-repo-url>
cd credit-default-modeling

(Use your actual repo name in place of credit-default-modeling if you chose a different one.)

Step 2: Set up virtual environment

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

# If you get an execution policy error:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Linux/macOS:

python3 -m venv venv
source venv/bin/activate

Step 3: Install dependencies

python -m pip install --upgrade pip
pip install -r requirements.txt

Step 4: Run the notebook

Start Jupyter:
```
jupyter notebook
```
Open credit_default_modeling.ipynb.
Run all cells in order. Data is downloaded in the first section; the notebook covers loading, preprocessing, training, evaluation, and predictions.

Step 5: Run tests

pytest tests/

Optional: Code formatting

isort --profile=black . && black --line-length 88 .

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Raw Data Input                            │
│         (application_train_aai.csv, test files)               │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Data Preprocessing Pipeline                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Outlier      │  │ Categorical  │  │ Missing      │       │
│  │ Correction   │→ │ Encoding     │→ │ Value        │       │
│  │              │  │              │  │ Imputation   │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
│  ┌──────────────┐                                            │
│  │ Feature      │                                            │
│  │ Scaling      │                                            │
│  │ (MinMax)     │                                            │
│  └──────────────┘                                            │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Feature Selection Pipeline                      │
│  ┌──────────────┐  ┌──────────────┐                         │
│  │ Variance     │  │ SelectKBest  │                         │
│  │ Threshold    │→ │ (k=75)       │                         │
│  │ (0.01)       │  │              │                         │
│  └──────────────┘  └──────────────┘                         │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Model Training                                  │
│  LightGBM Classifier (RandomizedSearchCV, roc_auc)           │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              Model Evaluation & Predictions                  │
│  ROC AUC on train/validation; probability outputs           │
└─────────────────────────────────────────────────────────────┘

Project Structure

credit-default-modeling/
├── credit_default_modeling.ipynb   # Main notebook
├── GITHUB_UPLOAD.md                # Steps to publish repo with neutral name
├── README.md
├── README_og.md
├── requirements.txt
├── src/
│   ├── config.py
│   ├── data_utils.py
│   └── preprocessing.py
└── tests/
    ├── conftest.py
    ├── test_preprocessing.py
    └── test_data_utils.py

Technologies Used

Python 3.8+: Core language
Pandas: Data manipulation
NumPy: Numerical computation
Scikit-learn: Preprocessing, feature selection, evaluation
LightGBM: Gradient boosting classifier
Matplotlib / Seaborn: Visualization
Jupyter: Notebooks
Pytest: Testing
Black / isort: Code formatting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Default Risk Prediction

Table of Contents

Problem Statement

Dataset Description

Dataset characteristics

Modeling Approach

Data preprocessing pipeline

Model architecture

Model selection

Evaluation Metrics

Model performance

Results

How to Run

Prerequisites

Step 1: Clone the repository

Step 2: Set up virtual environment

Step 3: Install dependencies

Step 4: Run the notebook

Step 5: Run tests

Optional: Code formatting

Architecture Diagram

Project Structure

Technologies Used

References

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
credit_default_modeling.ipynb		credit_default_modeling.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Credit Default Risk Prediction

Table of Contents

Problem Statement

Dataset Description

Dataset characteristics

Modeling Approach

Data preprocessing pipeline

Model architecture

Model selection

Evaluation Metrics

Model performance

Results

How to Run

Prerequisites

Step 1: Clone the repository

Step 2: Set up virtual environment

Step 3: Install dependencies

Step 4: Run the notebook

Step 5: Run tests

Optional: Code formatting

Architecture Diagram

Project Structure

Technologies Used

References

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages