The goal of this project is to study a dataset containing loan applications and truth labels corresponding to approval/rejection and find a model to predict whether a new loan application be approved or rejected. This is a binary classification problem that can be solved using a supervised learning model as truth labels are present. However, special care must be taken to mitigate the impact of class imbalance (i.e. uneven distribution of approval vs. rejection cases) on classification performance. Special attention is given to improving the model's ability to predict minority classes ("Yes") to support more accurate decision-making and benfit the financial firm.
For a loan approval prediction since it involves labeled data, it makes sense to use supervised learning model. Various supervised learning models were explored, for e.g. generalized linear model(GLM), Least Absolute Shrinkage and Selection Operator(LASSO), random forest, gradient boosting machine(GBM) and extreme gradient boosting(XGBoost). XGBoost was found to be the best performing model.
Unsupervised learning model is useful when they are no labels and the goal is to find hidden patterns or anomalies.
An attempt was made to use unsupervised learning model. In particular, density based spatial clustering with noise (DBSCAN) was used to check if it predicts two clusters. The reason to choose DBSCAN clustering instead of k-means or hierarchical clustering is that there is no requirement to specify the number of clusters. Initially it appeared that the model predicted 2 clusters and several anomalies, however the values for evaluation metrics like precision, recall and F1 Score had perfect scores of 1 and raised some alarms. Futher debugging led to the conclusion that truth labels were used accidentally in DBSCAN clustering and thus it resulted in two perfect clusters and some noise points/anomalies. These results are shown in the appendix of the report report. .
Look for the FinalReport_Team4.pdf for a detailed report showing executive summary, exploratory data analysis (EDA), feature engineering, missing value imputation, feature transformations, models used, results and comments on future improvements.