This project implements a robust, containerized ELT pipeline using the Modern Data Stack. It transforms raw TPC-H data into business-ready dimensions and facts. The focus of this project was on implementing production-grade features: layered data modelling, automated testing, and orchestrated scheduling via Airflow.
graph LR
A[Raw Data] --> B[Snowflake Staging]
B --> C[dbt Intermediate]
C --> D[Business Marts]
D --> E[BI Tool]
Data Warehouse: Snowflake (Storage and Compute)
Transformation: dbt Core (v1.11.4)
Orchestration: Apache Airflow (via Astronomer Cosmos)
Infrastructure: Docker (Containerized Environment)
Language: SQL & Python
The project follows a modular, three-layer architecture to ensure data quality and scalability:
1. Staging Layer: Raw data ingestion, casting, and renaming.
2. Intermediate Layer: Complex business logic and shipping performance calculations.
3. Mart Layer: Final Fact and Dimension tables optimized for BI tools like Lightdash.
Instead of treating DBT as a "black box," I utilized Astronomer Cosmos to render DBT models as native Airflow tasks. This provides:
Task-level visibility into failures.
The ability to run tests immediately after specific model builds.

Retries: Tasks are configured with a 5-minute retry delay to handle transient Snowflake connection issues.
Error Handling: Implemented a custom dag_failure_callback that triggers automated email alerts upon task failure.
Schema Governance: Used a custom generate_schema_name macro to manage dynamic schema environments (Staging, Intermediate, Mart) in Snowflake.
For this example, I created another schema for the models.

- Staging and Intermediate models will persist in the TPCH_PROD:

- Snapshots in the Snapshot schema and Fact and Dim tables in Mart schema:
Over 15 data tests are embedded in the pipeline, including:
Generic Tests: unique, not_null, and relationships.
Custom Tests: accepted_range from dbt_utils to validate shipping durations.
Descriptions and Primary Key definitions are maintained in YAML and persisted directly into Snowflake as object comments, ensuring the data dictionary is always accessible to analysts.
git clone https://github.com/angelinauesato/tpch_snowflake_project.git
Before running the pipeline, you must set up the necessary infrastructure in Snowflake.
- Log in to your Snowflake console.
- Open a new SQL Worksheet.
- Open the script and update with your user where you see: <YOUR_USER> scripts/snowflake_setup.sql
3.1. You can run to see the user: SELECT CURRENT_USER(); - Run the contents of scripts/snowflake_setup.sql.
- Configure Environment Variables: Create a .env file with your Snowflake credentials (Account, User, Password, Role, Warehouse).
Spin up the stack:
docker compose up -d
Access the UI:
Airflow: http://localhost:8080
dbt Docs: http://localhost:8081