🇷🇺 Русская версия | 🇬🇧 English version
University competition: Machine Learning Course (Central University)
Platform: Kaggle
Organizer: Central University & T-Bank
Timeline: November 2025
Build a machine learning model to predict early loan repayment (attrition) using T-Bank credit product data. When customers repay loans early, banks earn less interest revenue — predicting this at the application stage is crucial for profitability optimization.
- Target variable:
a6_flg(early repayment flag) - Products: 4 credit products (product_1 — product_4)
- Time period: data split by months (
month_dt) - Features: ~100+ features (feature_0 — feature_N)
Temporal distribution shift: test set significantly differs from training set in temporal distribution. Requires model stability monitoring across months and overfitting control.
Missing values:
- Identified many features with high missing rates (50%+, 70%+)
- Analyzed feature importance for high-missing features using RandomForest
- Removed features with >70% missing values
- Applied median imputation (
SimpleImputer) for remaining features
Class imbalance:
- Target variable shows imbalance (attrition is a rarer event)
- Used stratified validation to preserve class proportions
Model choice: CatBoostClassifier
Rationale:
- Native handling of categorical features
- Resistant to overfitting (Ordered Boosting)
- High performance on tabular data
- Built-in missing value handling
Hyperparameters:
CatBoostClassifier(
iterations=700,
learning_rate=0.03,
depth=6,
eval_metric='AUC',
random_state=42
)Validation strategy:
- Train-test split with
stratify=y(80/20) - ROC-AUC monitoring on validation set
- Early stopping with
use_best_model=True
- Baseline: 0.73707 ROC-AUC
- Kaggle leaderboard: 45th place, score: 0.75046
- Removing highly sparse features (>70% missing) improved model stability
- CatBoost demonstrated robustness to temporal shift through overfitting control
- Median imputation proved effective for numerical features with moderate missing rates
- Data processing:
pandas,numpy - Visualization:
matplotlib - Modeling:
CatBoost,RandomForestClassifier(for feature importance) - Metrics:
roc_auc_score
- Platform: Kaggle Notebooks (NVIDIA Tesla T4 GPU)
- Language: Python 3.11
Competition link: Kaggle — CU 2025 Scoring