This project implements a Sports Data Warehouse for football data using a full Raw → Staging → Warehouse architecture. The goal is to ingest data from an external football API, model it using a star schema, and enable incremental loads and BI-style analytics queries.
The warehouse answers questions such as:
- Who are the top scorers?
- Which players are the most consistent?
- How do teams and nationalities perform over time?
raw/ → JSON snapshots from API Football
staging/ → Cleaned & incremental staging tables (DuckDB)
warehouse/ → Star schema (dimensions + fact)
etl/ → SQL + Python ETL logic
analytics/ → BI queries & sanity checks
-
Raw: API responses stored as JSON snapshots
-
Staging:
- Base staging tables (flattened JSON)
- Incremental staging using
snapshot_date
-
Warehouse:
- Dimension tables (
dim_player,dim_team,dim_league) - Fact table (
fact_player_stats)
- Dimension tables (
-
Incremental control via
etl_control
- dim_player: player attributes (name, age, nationality)
- dim_team: team metadata
- dim_league: league metadata
- fact_player_stats
Grain:
One row per player – team – league – season
This grain is enforced and validated using a sanity check.
Incremental logic is driven by the etl_control table:
(source_name, last_snapshot)Each ETL run:
- Reads only snapshots newer than
last_snapshot - Merges data into dimensions and fact tables
- Updates
etl_control
This avoids full reloads and supports historical corrections.
A grain validation query ensures the fact table has no duplicates:
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT player_key, team_key, league_key, season) AS distinct_events
FROM fact_player_stats;(Implemented via subquery due to DuckDB limitations.)
The analytics/ folder contains example BI queries, including:
- Top scorers
- Most appearances
- Goals by team
- Goals by nationality
- All-round players (goals + assists)
These queries demonstrate how the warehouse can be used for football performance analysis.
- DuckDB – analytical database
- Python – orchestration
- SQL – transformations & modeling
- API Football – data source
- Dockerize the project for reproducibility
- Add automated ETL checks
- Optional BI dashboard (Power BI / Superset)
docker compose up
## 👤 Author
Built as a learning-focused end-to-end data engineering project.