📊 TPC-H Analytics Pipeline: Snowflake + dbt + Airflow

🚀 Project Overview

This project implements a robust, containerized ELT pipeline using the Modern Data Stack. It transforms raw TPC-H data into business-ready dimensions and facts. The focus of this project was on implementing production-grade features: layered data modelling, automated testing, and orchestrated scheduling via Airflow.

graph LR
    A[Raw Data] --> B[Snowflake Staging]
    B --> C[dbt Intermediate]
    C --> D[Business Marts]
    D --> E[BI Tool]

🛠 Tech Stack

Data Warehouse: Snowflake (Storage and Compute)

Transformation: dbt Core (v1.11.4)

Orchestration: Apache Airflow (via Astronomer Cosmos)

Infrastructure: Docker (Containerized Environment)

Language: SQL & Python

📐 Data Architecture & Lineage

The project follows a modular, three-layer architecture to ensure data quality and scalability:

1. Staging Layer: Raw data ingestion, casting, and renaming.

2. Intermediate Layer: Complex business logic and shipping performance calculations.

3. Mart Layer: Final Fact and Dimension tables optimized for BI tools like Lightdash.

⚙️ Key Engineering Features

1. Advanced Orchestration with Cosmos

Instead of treating DBT as a "black box," I utilized Astronomer Cosmos to render DBT models as native Airflow tasks. This provides:

Task-level visibility into failures.

The ability to run tests immediately after specific model builds.

2. Production-Grade Reliability

Retries: Tasks are configured with a 5-minute retry delay to handle transient Snowflake connection issues.

Error Handling: Implemented a custom dag_failure_callback that triggers automated email alerts upon task failure.

Schema Governance: Used a custom generate_schema_name macro to manage dynamic schema environments (Staging, Intermediate, Mart) in Snowflake.
For this example, I created another schema for the models.

Staging and Intermediate models will persist in the TPCH_PROD:
Snapshots in the Snapshot schema and Fact and Dim tables in Mart schema:

3. Automated Data Quality

Over 15 data tests are embedded in the pipeline, including:

Generic Tests: unique, not_null, and relationships.

Custom Tests: accepted_range from dbt_utils to validate shipping durations.

4. Self-Documenting Metadata

Descriptions and Primary Key definitions are maintained in YAML and persisted directly into Snowflake as object comments, ensuring the data dictionary is always accessible to analysts.

🚦 Getting Started

Clone the repository:

git clone https://github.com/angelinauesato/tpch_snowflake_project.git

Snowflake Configuration:

Before running the pipeline, you must set up the necessary infrastructure in Snowflake.

Log in to your Snowflake console.
Open a new SQL Worksheet.
Open the script and update with your user where you see: <YOUR_USER> scripts/snowflake_setup.sql
3.1. You can run to see the user: SELECT CURRENT_USER();
Run the contents of scripts/snowflake_setup.sql.
Configure Environment Variables: Create a .env file with your Snowflake credentials (Account, User, Password, Role, Warehouse).

Spin up the stack:

docker compose up -d

Access the UI:

Airflow: http://localhost:8080

dbt Docs: http://localhost:8081

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dags		dags
dbt		dbt
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 TPC-H Analytics Pipeline: Snowflake + dbt + Airflow

🚀 Project Overview

🛠 Tech Stack

📐 Data Architecture & Lineage

⚙️ Key Engineering Features

1. Advanced Orchestration with Cosmos

2. Production-Grade Reliability

3. Automated Data Quality

4. Self-Documenting Metadata

🚦 Getting Started

Clone the repository:

Snowflake Configuration:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 TPC-H Analytics Pipeline: Snowflake + dbt + Airflow

🚀 Project Overview

🛠 Tech Stack

📐 Data Architecture & Lineage

⚙️ Key Engineering Features

1. Advanced Orchestration with Cosmos

2. Production-Grade Reliability

3. Automated Data Quality

4. Self-Documenting Metadata

🚦 Getting Started

Clone the repository:

Snowflake Configuration:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages