Binary classification for loan default risk
This project implements a machine learning pipeline that predicts whether loan applicants will have payment difficulties. The solution supports credit risk assessment using application and loan data.
- Problem Statement
- Dataset Description
- Modeling Approach
- Evaluation Metrics
- Results
- How to Run
- Architecture Diagram
- Project Structure
- Technologies Used
The task is a binary classification problem: predict whether a person applying for home credit will be able to repay their debt.
- Target 1: Client will have payment difficulties (late payment of more than X days on at least one of the first Y installments of the loan in the sample).
- Target 0: Client will repay without such difficulties.
The model must output probabilities that a loan will not be paid for each applicant. The goal is to support lending decisions and default risk management.
The dataset is based on the Home Credit Default Risk Kaggle competition. This project uses the primary application files only:
application_train_aai.csv: Training data with labels.application_test_aai.csv: Test data (no labels); used for final predictions.
- Task: Binary classification.
- Features: Applicant demographics, financial information, employment details, and housing/family information.
- Class distribution: Imbalanced (typical for credit risk).
- Data quality: Missing values, outliers (e.g.
DAYS_EMPLOYED), and categorical variables that require encoding.
Data is downloaded automatically when running the notebook’s data-loading section.
-
Outlier correction
Anomalous values (e.g.DAYS_EMPLOYED= 365243) are replaced with NaN. -
Categorical encoding
- Binary categoricals:
OrdinalEncoder. - Multi-category:
OneHotEncoder(sparse_output=False, handle_unknown='ignore').
- Binary categoricals:
-
Missing value imputation
SimpleImputerwith median strategy for all numerical features. -
Feature scaling
MinMaxScalerto scale features to [0, 1].
The final model is a LightGBM-based pipeline:
- VarianceThreshold (threshold=0.01): drop low-variance features.
- SelectKBest (f_classif, k=75): keep top 75 features.
- LGBMClassifier: n_estimators=200, max_depth=12, learning_rate=0.05, num_leaves=31, subsample=0.8, colsample_bytree=0.8, min_child_samples=20, class_weight='balanced'.
Three approaches were compared:
| Approach | Validation ROC AUC | Train–val gap |
|---|---|---|
| Random Forest (tuned) | 0.7302 | 0.0768 |
| LightGBM (basic) | 0.7531 | 0.0425 |
| LightGBM pipeline | 0.7493 | 0.0417 |
The LightGBM pipeline was chosen for its balance of performance, generalization, and efficiency. Hyperparameters were tuned with RandomizedSearchCV (3-fold CV, scoring='roc_auc', 15 iterations, random_state=42).
Primary metric: ROC AUC (Area Under the ROC Curve)
ROC AUC is used because it is threshold-independent, suitable for imbalanced data, and reflects probability ranking quality. It is a standard choice for credit risk models.
| Model | Train ROC AUC | Validation ROC AUC | Train–val gap |
|---|---|---|---|
| Random Forest | 0.8070 | 0.7302 | 0.0768 |
| LightGBM (basic) | 0.7956 | 0.7531 | 0.0425 |
| LightGBM pipeline | 0.7910 | 0.7493 | 0.0417 |
- Validation ROC AUC: 0.7493 for the selected LightGBM pipeline.
- Generalization: Train–validation gap of 0.0417 indicates limited overfitting.
- The pipeline (feature selection + LightGBM) is used as the final model for predictions.
- Python 3.8 or higher
- pip
- Git (for cloning)
git clone <your-repo-url>
cd credit-default-modeling(Use your actual repo name in place of credit-default-modeling if you chose a different one.)
Windows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1
# If you get an execution policy error:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserLinux/macOS:
python3 -m venv venv
source venv/bin/activatepython -m pip install --upgrade pip
pip install -r requirements.txt- Start Jupyter:
jupyter notebook
- Open
credit_default_modeling.ipynb. - Run all cells in order. Data is downloaded in the first section; the notebook covers loading, preprocessing, training, evaluation, and predictions.
pytest tests/isort --profile=black . && black --line-length 88 .┌─────────────────────────────────────────────────────────────┐
│ Raw Data Input │
│ (application_train_aai.csv, test files) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Preprocessing Pipeline │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Outlier │ │ Categorical │ │ Missing │ │
│ │ Correction │→ │ Encoding │→ │ Value │ │
│ │ │ │ │ │ Imputation │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ │
│ │ Feature │ │
│ │ Scaling │ │
│ │ (MinMax) │ │
│ └──────────────┘ │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Feature Selection Pipeline │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Variance │ │ SelectKBest │ │
│ │ Threshold │→ │ (k=75) │ │
│ │ (0.01) │ │ │ │
│ └──────────────┘ └──────────────┘ │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Model Training │
│ LightGBM Classifier (RandomizedSearchCV, roc_auc) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Model Evaluation & Predictions │
│ ROC AUC on train/validation; probability outputs │
└─────────────────────────────────────────────────────────────┘
credit-default-modeling/
├── credit_default_modeling.ipynb # Main notebook
├── GITHUB_UPLOAD.md # Steps to publish repo with neutral name
├── README.md
├── README_og.md
├── requirements.txt
├── src/
│ ├── config.py
│ ├── data_utils.py
│ └── preprocessing.py
└── tests/
├── conftest.py
├── test_preprocessing.py
└── test_data_utils.py
- Python 3.8+: Core language
- Pandas: Data manipulation
- NumPy: Numerical computation
- Scikit-learn: Preprocessing, feature selection, evaluation
- LightGBM: Gradient boosting classifier
- Matplotlib / Seaborn: Visualization
- Jupyter: Notebooks
- Pytest: Testing
- Black / isort: Code formatting
This project is for educational purposes.
Marissa Singh