Heart Attack Prediction in Indonesia - Data Engineering Zoomcamp - GitHub
This project builds an end-to-end data pipeline for analyzing heart attack prediction in Indonesia. Using Google Cloud Platform (GCP) and various data engineering tools, the pipeline automates data ingestion, transformation, and visualization to provide insights via Power BI dashboards.
Cardiovascular diseases, including heart attacks, are a leading cause of death worldwide. This project automates the processing, transformation, and visualization of a dataset related to heart attack prediction in Indonesia using cloud-based infrastructure. The goal is to help identify risk factors and trends through data analytics.
- Cloud Infrastructure: Terraform (GCP setup)
- Data Orchestration: Apache Airflow (Google Cloud Composer)
- Storage & Processing: Google Cloud Storage (GCS) & BigQuery
- Data Transformation: dbt (data modeling)
- Dashboarding: Power BI
- CI/CD: GitHub Actions
- Scripting & Automation: Python & Bash
The pipeline consists of automated scripts located in the scripts/ folder, which handle every stage of the workflow:
1οΈβ£ Setup the Virtual Environment & Install Dependencies
2οΈβ£ Download & Extract Dataset
3οΈβ£ Deploy Cloud Infrastructure (Terraform)
4οΈβ£ Configure Airflow & Authenticate with Google Cloud
5οΈβ£ Run Data Transformation with dbt
6οΈβ£ Analyze and Visualize Data using Power BI
heart-attack-prediction/
βββ terraform/ # Infrastructure as Code (Terraform)
βββ airflow/ # Airflow DAGs for orchestration
βββ data/ # Raw and processed dataset storage
βββ dbt/ # dbt models for transformation
βββ powerbi/ # Power BI dashboards
βββ scripts/ # Helper scripts (data ingestion, automation)
βββ service-account-key/ # Store your GCP service account JSON key
βββ .github/ # GitHub Actions for CI/CD
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
git clone https://github.com/JoshPola96/heart-attack-data-pipeline.git
cd heart-attack-data-pipeline
Before running any scripts, set up a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
Please take care to install any additional dependencies as you work with the scripts. Apologies for that.
Before running Terraform and Airflow, you must set up authentication:
- Navigate to Google Cloud Console
- Go to IAM & Admin > Service Accounts
- Create a new service account with the necessary permissions (BigQuery Admin, Storage Admin, Composer Admin).
- Generate a JSON key file and download it.
mv ~/Downloads/<your-key-file>.json ~/heart-attack-prediction/service-account-key/
export GOOGLE_APPLICATION_CREDENTIALS=~/heart-attack-prediction/service-account-key/<your-key-file>.json
Execute the following scripts in the scripts/ folder:
bash scripts/download-dataset.sh
This script will download the dataset into the data/ folder.
bash scripts/execute-terraform.sh
This will provision:
- A Google Cloud Storage (GCS) bucket for data storage.
- A BigQuery dataset for structured data processing.
Before running Airflow, update the execute-airflow.sh script with the correct service account key file path.
Make sure the service account key JSON file is in:
~/heart-attack-prediction/service-account-key/
If needed, rename it to match the expected format:
mv ~/Downloads/<your-key-file>.json ~/heart-attack-prediction/service-account-key/heart-attack-dataset.json
Modify the scripts/execute-airflow.sh script to ensure it correctly sets up Google Cloud authentication:
#!/bin/bash
set -e
# Activate virtual environment
source ~/heart-attack-prediction/venv/bin/activate
# Set up GCP authentication
export GOOGLE_APPLICATION_CREDENTIALS="~/heart-attack-prediction/service-account-key/<your-key-file>.json"
export AIRFLOW_HOME=~/airflow
Run the script:
bash scripts/execute-airflow.sh
Ensure that your profiles.yml file is properly configured:
-
Navigate to
~/.dbt/profiles.ymland update the service account key path:bigquery: outputs: dev: type: bigquery method: service-account project: <YOUR_GCP_PROJECT_ID> dataset: <YOUR_BIGQUERY_DATASET> threads: 4 keyfile: ~/heart-attack-prediction/service-account-key/<your-key-file>.json -
Run the transformation:
bash scripts/execute-dbt.sh
This will clean and transform the dataset in BigQuery.
π¨ Important Note: Power BI cannot be fully automated in the free version. However, you can manually open and interact with the report.
The Power BI report (.pbix) file is available in the powerbi/ folder. You can also view the generated pdf report with minimal formatting available in the folder.
1οΈβ£ Download and Install Power BI Desktop (if not already installed)
2οΈβ£ Open the .pbix File
- Navigate to the
powerbi/folder in your local project directory. - Double-click the
.pbixfile to open it in Power BI Desktop.
3οΈβ£ Ensure Data Refresh is Enabled
- If prompted, sign in with a Microsoft Account to access cloud-based data.
- Click Transform Data β Data Source Settings and update the BigQuery credentials if necessary.
- Click Refresh in the toolbar to load the latest data from BigQuery.
4οΈβ£ Explore the Visualizations
- Use the built-in charts, tables, and filters to analyze heart attack risk factors and trends in Indonesia.
GitHub Actions is configured to automate:
β
Terraform Infrastructure Deployment
β
Airflow DAG Upload & Scheduling
β
dbt Transformations Execution
π Power BI Report Updates (Future Scope)
βοΈ Implement Machine Learning models for predictive analytics.
βοΈ Automate Power BI dashboard deployment.
βοΈ Expand dataset to include more demographic insights.