GitHub - meetpotdar777/Credit-Risk-Prediction-using-Machine-Learning: 🏦 Machine Learning system for credit default prediction using a RandomForestClassifier. Features an end-to-end pipeline including synthetic financial data generation, robust preprocessing (ColumnTransformer), and comprehensive evaluation with ROC-AUC and Confusion Matrices.

Credit Risk Prediction using Machine Learning 💰🏦

Project Overview

Credit risk prediction is a fundamental aspect of financial risk management. Lenders, from banks to individual investors, need to accurately assess the likelihood of a borrower defaulting on a loan. This project demonstrates how Machine Learning (ML) can be applied to predict credit risk, offering a more robust and data-driven approach compared to traditional methods.

The system showcases the end-to-end process: from generating synthetic financial data that mimics real-world patterns, through data preprocessing, to training and evaluating a classification model, and finally, making predictions for new applicants.

✨ Features

1.Synthetic Data Generation: Creates a simulated dataset with various financial and demographic features (income, age, loan amount, credit score, employment duration, etc.) and a 'default' target variable, designed to reflect realistic credit risk correlations and class imbalance.

2.Data Preprocessing Pipeline: Utilizes ColumnTransformer with StandardScaler for numerical features and OneHotEncoder for categorical features, ensuring consistent and robust data transformation for both training and new predictions.

3.Machine Learning Model: Employs a RandomForestClassifier, a powerful ensemble model, to learn complex patterns within the data and predict credit risk.

4.Model Evaluation: Provides comprehensive evaluation metrics including:

5.Confusion Matrix: To visualize True Positives, True Negatives, False Positives, and False Negatives.

6.Classification Report: Detailing Precision, Recall, and F1-score for both 'No Default' and 'Default' classes, crucial for imbalanced datasets.

7.Accuracy Score: Overall model correctness.

8.ROC AUC Score & Curve: To assess the model's ability to discriminate between defaulting and non-defaulting borrowers across various thresholds.

9.New Borrower Prediction: A dedicated function to simulate predicting risk for unseen loan applicants, outputting both a risk label ("High Risk of Default" / "Low Risk of Default") and a precise probability.

🧠 How It Works (Technical Flow)

generate_synthetic_data():

Creates a pandas.DataFrame with a specified number of samples.

Populates columns with realistic distributions for income, age, loan amount, credit score, etc.

Calculates a default_probability for each row based on these features, intentionally correlating higher risk factors (e.g., low income, low credit score, high loan amount) with higher default probabilities.

Crucially, it ensures a realistic class imbalance by assigning 'Default' (class 1) to a controlled minority percentage of the highest-risk individuals, and 'No Default' (class 0) to the majority. The data is then shuffled.

preprocess_data():

Identifies numerical and categorical columns in the raw data.

Initializes a ColumnTransformer (a powerful tool for applying different transformations to different columns).

StandardScaler: Applied to numerical features to normalize their range.

OneHotEncoder(handle_unknown='ignore', drop='first'): Applied to categorical features to convert them into numerical (binary) format, handling unseen categories gracefully and preventing multicollinearity.

Fits this ColumnTransformer on the training data. This step learns the scaling parameters and the categories for one-hot encoding.

Transforms the data, resulting in a preprocessed X DataFrame ready for the ML model.

train_model():

Splits the preprocessed data into training and testing sets (X_train, X_test, y_train, y_test).

Initializes a RandomForestClassifier with:

n_estimators=100: Builds 100 decision trees.

random_state=42: Ensures reproducibility.

class_weight='balanced': Important for imbalanced datasets, it automatically adjusts weights inversely proportional to class frequencies to combat bias towards the majority class.

min_samples_leaf=1: Allows the trees to learn very specific patterns, which can be useful for capturing the minority class (defaults).

Trains the model on the X_train and y_train data.

evaluate_model():

Uses the trained model to make predictions on the unseen X_test data.

Prints a detailed Confusion Matrix and Classification Report (showing precision, recall, f1-score per class).

Calculates and plots the Receiver Operating Characteristic (ROC) Curve and its Area Under the Curve (AUC) score, which is a robust metric for imbalanced classification problems, indicating the model's overall discriminative power.

predict_new_borrower_risk():

Takes raw input features for a new borrower.

Crucially, it uses the same fitted ColumnTransformer (preprocessor global variable) to transform this new raw data. This guarantees that the new data is preprocessed exactly how the training data was, preventing feature mismatch errors.

The transformed data is then fed to the trained RandomForestClassifier to get a risk prediction label and a probability of default.

🚀 Setup and Installation

To get this project up and running on your local machine, follow these steps:

Save the Code:

Save the entire Python code block into a file named credit_risk_prediction.py on your computer.

Install Python:

Ensure you have Python 3.x installed (e.g., Python 3.8+). You can download it from python.org.

Install Required Libraries:

Open your terminal or command prompt, navigate to the directory where you saved credit_risk_prediction.py, and run the following command:

pip install pandas numpy scikit-learn matplotlib seaborn

🎮 Usage

Navigate to the directory containing credit_risk_prediction.py in your terminal or command prompt and run the script:

python credit_risk_prediction.py

The script will:

Generate synthetic data.

Preprocess the data.

Train the machine learning model.

Evaluate the model's performance on a test set, displaying metrics and an ROC curve plot.

Make predictions for three example borrowers with different risk profiles, printing their predicted risk and probability of default.

You will see output similar to this, demonstrating the model's ability to differentiate between low and high-risk borrowers:

--- Credit Risk Prediction System ---

Generating 5000 synthetic data samples...
Synthetic data generation complete.

Sample of generated data:

... (truncated DataFrame output) ...

Default distribution:
default
0    4250
1     750
Name: count, dtype: int64
Starting data preprocessing...
Data preprocessing complete.

Training data shape: (4000, 13)
Testing data shape: (1000, 13)
Training the RandomForestClassifier model...
Model training complete.

--- Model Evaluation ---

Confusion Matrix:

[[844   6]
 [ 60  90]]

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.99      0.96       850
           1       0.94      0.60      0.73       150

    accuracy                           0.93      1000
   macro avg       0.94      0.80      0.85      1000
weighted avg       0.93      0.93      0.93      1000


Accuracy: 0.9340
ROC AUC Score: 0.9845
Evaluation complete. ROC Curve displayed.

--- Testing Prediction for a New Borrower ---

Predicting credit risk for new borrower...
Prediction: Low Risk of Default
Probability of Default: 0.00%

--- Testing Prediction for another New Borrower ---

Predicting credit risk for new borrower...
Prediction: High Risk of Default
Probability of Default: 92.00%

--- Testing Prediction for a Very High-Risk Borrower (as per your specific scenario) ---

Predicting credit risk for new borrower...
Prediction: High Risk of Default
Probability of Default: 90.00%

--- All operations completed successfully! ---

⚠️ Limitations

This project is a strong conceptual demonstration but has inherent limitations for production use:

Synthetic Data: The dataset is artificially generated. Real-world financial data is far more complex, noisy, and often requires extensive feature engineering, domain expertise, and handling of missing values.

Simplified Model: While RandomForestClassifier is robust, advanced credit scoring often employs more specialized models, deep learning, or ensemble methods tailored to highly imbalanced data and specific business objectives (e.g., minimizing false negatives).

No Time-Series Aspect: Credit risk often has a time-series component (e.g., payment history, economic trends) which is not captured here.

Explainability: While Random Forests are somewhat interpretable, for high-stakes financial decisions, more transparent models (e.g., Logistic Regression with interpretable features) or techniques like SHAP/LIME for explainability are crucial.

Regulatory Compliance: Real-world credit risk models must adhere to strict regulatory guidelines (e.g., fair lending laws, model validation standards) which are not covered in this example.

🔮 Future Enhancements

Real-World Data Integration: Load and work with actual anonymized credit datasets (e.g., from Kaggle, UCI Machine Learning Repository).

Advanced Feature Engineering: Create features like Debt-to-Income ratio, credit utilization, payment history patterns, etc.

Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV to find optimal hyperparameters for the RandomForest model.

Advanced Imbalance Handling: Explore techniques like SMOTE (Synthetic Minority Over-sampling Technique), ADASYN, or different sampling strategies.

Different Models: Experiment with other classification algorithms (e.g., Gradient Boosting Machines like XGBoost, LightGBM; Neural Networks).

Model Interpretability: Implement SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to understand feature importance for individual predictions.

Web Application: Build a simple web interface (e.g., using Flask or Streamlit) to allow interactive input and prediction.

Deployment: Learn about deploying ML models into a production environment.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
credit_risk_prediction.py		credit_risk_prediction.py
readme.md		readme.md

Folders and files

Latest commit

History

Repository files navigation

Credit Risk Prediction using Machine Learning 💰🏦

Project Overview

The system showcases the end-to-end process: from generating synthetic financial data that mimics real-world patterns, through data preprocessing, to training and evaluating a classification model, and finally, making predictions for new applicants.

✨ Features

1.Synthetic Data Generation: Creates a simulated dataset with various financial and demographic features (income, age, loan amount, credit score, employment duration, etc.) and a 'default' target variable, designed to reflect realistic credit risk correlations and class imbalance.

2.Data Preprocessing Pipeline: Utilizes ColumnTransformer with StandardScaler for numerical features and OneHotEncoder for categorical features, ensuring consistent and robust data transformation for both training and new predictions.

3.Machine Learning Model: Employs a RandomForestClassifier, a powerful ensemble model, to learn complex patterns within the data and predict credit risk.

4.Model Evaluation: Provides comprehensive evaluation metrics including:

5.Confusion Matrix: To visualize True Positives, True Negatives, False Positives, and False Negatives.

6.Classification Report: Detailing Precision, Recall, and F1-score for both 'No Default' and 'Default' classes, crucial for imbalanced datasets.

7.Accuracy Score: Overall model correctness.

8.ROC AUC Score & Curve: To assess the model's ability to discriminate between defaulting and non-defaulting borrowers across various thresholds.

9.New Borrower Prediction: A dedicated function to simulate predicting risk for unseen loan applicants, outputting both a risk label ("High Risk of Default" / "Low Risk of Default") and a precise probability.

🧠 How It Works (Technical Flow)

generate_synthetic_data():

Creates a pandas.DataFrame with a specified number of samples.

Populates columns with realistic distributions for income, age, loan amount, credit score, etc.

Calculates a default_probability for each row based on these features, intentionally correlating higher risk factors (e.g., low income, low credit score, high loan amount) with higher default probabilities.

Crucially, it ensures a realistic class imbalance by assigning 'Default' (class 1) to a controlled minority percentage of the highest-risk individuals, and 'No Default' (class 0) to the majority. The data is then shuffled.

preprocess_data():

Identifies numerical and categorical columns in the raw data.

Initializes a ColumnTransformer (a powerful tool for applying different transformations to different columns).

StandardScaler: Applied to numerical features to normalize their range.

OneHotEncoder(handle_unknown='ignore', drop='first'): Applied to categorical features to convert them into numerical (binary) format, handling unseen categories gracefully and preventing multicollinearity.

Fits this ColumnTransformer on the training data. This step learns the scaling parameters and the categories for one-hot encoding.

Transforms the data, resulting in a preprocessed X DataFrame ready for the ML model.

train_model():

Splits the preprocessed data into training and testing sets (X_train, X_test, y_train, y_test).

Initializes a RandomForestClassifier with:

n_estimators=100: Builds 100 decision trees.

random_state=42: Ensures reproducibility.

class_weight='balanced': Important for imbalanced datasets, it automatically adjusts weights inversely proportional to class frequencies to combat bias towards the majority class.

min_samples_leaf=1: Allows the trees to learn very specific patterns, which can be useful for capturing the minority class (defaults).

Trains the model on the X_train and y_train data.

evaluate_model():

Uses the trained model to make predictions on the unseen X_test data.

Prints a detailed Confusion Matrix and Classification Report (showing precision, recall, f1-score per class).

Calculates and plots the Receiver Operating Characteristic (ROC) Curve and its Area Under the Curve (AUC) score, which is a robust metric for imbalanced classification problems, indicating the model's overall discriminative power.

predict_new_borrower_risk():

Takes raw input features for a new borrower.

Crucially, it uses the same fitted ColumnTransformer (preprocessor global variable) to transform this new raw data. This guarantees that the new data is preprocessed exactly how the training data was, preventing feature mismatch errors.

The transformed data is then fed to the trained RandomForestClassifier to get a risk prediction label and a probability of default.

🚀 Setup and Installation

To get this project up and running on your local machine, follow these steps:

Save the Code:

Save the entire Python code block into a file named credit_risk_prediction.py on your computer.

Install Python:

Ensure you have Python 3.x installed (e.g., Python 3.8+). You can download it from python.org.

Install Required Libraries:

Open your terminal or command prompt, navigate to the directory where you saved credit_risk_prediction.py, and run the following command:

🎮 Usage

Navigate to the directory containing credit_risk_prediction.py in your terminal or command prompt and run the script:

The script will:

Generate synthetic data.

Preprocess the data.

Train the machine learning model.

Evaluate the model's performance on a test set, displaying metrics and an ROC curve plot.

Make predictions for three example borrowers with different risk profiles, printing their predicted risk and probability of default.

You will see output similar to this, demonstrating the model's ability to differentiate between low and high-risk borrowers:

⚠️ Limitations

This project is a strong conceptual demonstration but has inherent limitations for production use:

Synthetic Data: The dataset is artificially generated. Real-world financial data is far more complex, noisy, and often requires extensive feature engineering, domain expertise, and handling of missing values.

Simplified Model: While RandomForestClassifier is robust, advanced credit scoring often employs more specialized models, deep learning, or ensemble methods tailored to highly imbalanced data and specific business objectives (e.g., minimizing false negatives).

No Time-Series Aspect: Credit risk often has a time-series component (e.g., payment history, economic trends) which is not captured here.

Explainability: While Random Forests are somewhat interpretable, for high-stakes financial decisions, more transparent models (e.g., Logistic Regression with interpretable features) or techniques like SHAP/LIME for explainability are crucial.

Regulatory Compliance: Real-world credit risk models must adhere to strict regulatory guidelines (e.g., fair lending laws, model validation standards) which are not covered in this example.

🔮 Future Enhancements

Real-World Data Integration: Load and work with actual anonymized credit datasets (e.g., from Kaggle, UCI Machine Learning Repository).

Advanced Feature Engineering: Create features like Debt-to-Income ratio, credit utilization, payment history patterns, etc.

Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV to find optimal hyperparameters for the RandomForest model.

Advanced Imbalance Handling: Explore techniques like SMOTE (Synthetic Minority Over-sampling Technique), ADASYN, or different sampling strategies.

Different Models: Experiment with other classification algorithms (e.g., Gradient Boosting Machines like XGBoost, LightGBM; Neural Networks).

Model Interpretability: Implement SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to understand feature importance for individual predictions.

Web Application: Build a simple web interface (e.g., using Flask or Streamlit) to allow interactive input and prediction.

Deployment: Learn about deploying ML models into a production environment.

About

Packages