This project addresses the challenges in the lending industry by creating an end-to-end data pipeline for processing and analyzing lending club data. It enables financial institutions to:
- ✅ Process and clean large volumes of lending data for analysis
- ✅ Identify patterns in loan defaults and repayments
- ✅ Create risk profiles for borrowers
- ✅ Generate loan scores to assess the quality of loans
- ✅ Provide actionable insights through analytics dashboards
The platform helps lending institutions make better-informed decisions, reduce default risks, and optimize their lending strategies through data-driven insights.
This project is fully developed in the Google Cloud Platform (GCP) with Infrastructure as Code (IaC) principles using:
| Component | Purpose |
|---|---|
| Google Cloud Storage (GCS) | Data lake for raw and processed data |
| Google Dataproc | Managed Spark service for data processing |
| BigQuery | Data warehouse |
| Metabase | Data visualization and dashboards |
| Terraform | Infrastructure as Code for reproducible environments |
| Kestra | Workflow Orchestration |
| Docker | Containers |
The project implements a comprehensive batch data processing pipeline with:
- Data ingestion from source to GCS data lake
- Multiple transformation steps with dependencies
- Orchestrated workflow using Kestra
The data pipeline is structured as a sequence of PySpark jobs:
Each step is a PySpark job submitted to Dataproc with appropriate dependencies using Kestra orchestrator.
- Tables are partitioned by date fields (e.g., loan issue date)
- Clustered by frequently queried fields (e.g., member_id, loan_status)
- Star schema design with fact and dimension tables for efficient querying
This structure optimizes query performance for the analytics use cases, reducing query costs and improving response time.
The project uses Spark for comprehensive data transformations:
- Data cleaning and normalization
- Feature engineering for analytics
- Complex aggregations and calculations
- Creation of unified views for analytics
PySpark scripts handle all transformations with a focus on scalability and performance.
The analytics dashboard in Metabase provides:
- Distribution of loans by status, grade, and purpose
- Temporal trends in loan issuance and repayment
- Default rate analysis by demographic segments
- Loan score distribution and risk categorization
- Google Cloud Platform account with billing enabled
gcloudCLI installed and configured- Terraform installed (for infrastructure setup)
-
Clone the repository
git clone https://github.com/abhayra12/lending_data_analytics.git cd lending-data-analytics -
Set up GCP credentials
# Set up a service account with appropriate permissions and download the JSON key # Save it as gcp-creds.json in the project root export GOOGLE_APPLICATION_CREDENTIALS="$(pwd)/gcp-creds.json"
-
Deploy infrastructure with Terraform
cd terraform terraform init terraform apply -
Upload initial data to GCS
I have created a consolidated script for downloading Kaggle data and uploading scripts to GCS:
# Install required dependencies pip install -r requirements.txt # Download Kaggle data and upload to GCS python scripts/gcs_upload.py lending_ara --kaggle # Upload Python scripts to GCS python scripts/gcs_upload.py lending_ara --scripts # Or do both at once python scripts/gcs_upload.py lending_ara --all
This script:
- Downloads the Lending Club dataset from Kaggle directly to your local
data/folder - Automatically finds the CSV files in the downloaded dataset
- Copies the target CSV file (accepted_2007_to_2018Q4.csv) to your data directory
- Uploads the CSV file to the GCS bucket
- Optionally uploads all Python scripts to the
code/folder in the GCS bucket
Note: You need to have Kaggle credentials configured before running this script.
- Downloads the Lending Club dataset from Kaggle directly to your local
-
Run the data pipeline
I have created a Kestra workflow for the data pipeline. Start the Kestra Service by running the following command:
cd docker/kestra docker compose up -dGo to the Kestra UI by running the following command:
open http://localhost:8080
copy the flows from
docker/kestra/flowsto the Kestra UI. execute the flows in the kestra ui. -
Access the dashboard
cd docker/metabase docker compose up -d # Open in browser open http://localhost:3000
Create a new Metabase account and login. Select the
Lending Clubdatabase and theloan_datatable.
| Issue | Resolution |
|---|---|
| Dataproc job submission fails | Check the cluster logs |
| BigQuery access issues | Verify service account permissions |
| Data related issues | Check the intermediate outputs in GCS |
- Add real-time data processing for immediate insights



