F1-Pulse is a production-grade data engineering pipeline that ingests, transforms, and analyzes Formula 1 racing data using the Medallion Architecture (Bronze β Silver β Gold).
The pipeline pulls raw JSON responses from the OpenF1 REST API, processes them through three Delta Lake layers with automated data quality gates at each layer boundary, and delivers actionable driver and constructor performance insights. It is currently configured to process the 2025 Abu Dhabi Grand Prix, showcasing retry-safe API ingestion, schema drift detection, window function analytics, multi-table Gold outputs, and a fully orchestrated Databricks Workflow.
| Layer | Technology |
|---|---|
| Platform | Databricks (Serverless Compute) |
| Language | Python (PySpark + Pandas) |
| Storage | Delta Lake (ACID transactions, Time Travel, Schema Enforcement) |
| Governance | Unity Catalog (3-layer namespace: f1_project.bronze/silver/gold) |
| Data Quality | Soda Core (soda-core-spark-df) β SodaCL checks at Silver and Gold |
| Orchestration | Databricks Workflows (5-task linear pipeline with DQ gates) |
| Testing | pytest (unit + integration tests across all modules) |
| Version Control | GitHub (integrated with Databricks Repos) |
| Data Source | OpenF1 REST API (free, real-time F1 telemetry) |
OpenF1 REST API
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β BRONZE LAYER (raw_sessions, raw_laps, β
β raw_drivers, raw_telemetry) β
β β’ Retry-safe ingestion (3 attempts) β
β β’ Pandas middle-man for type safety β
β β’ ingested_at + source_url audit cols β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β SILVER TRANSFORMATION β
β (cleaned_sessions, enriched_laps) β
β β’ Schema drift detection β
β β’ Type casting β
β β’ Driver deduplication before join β
β β’ is_valid_lap quality flag β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β SILVER QUALITY GATE (Soda Core) β
β β’ Row count & freshness checks β
β β’ Primary key integrity β
β β’ Valid lap duration range β
β β’ Null checks on all key columns β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β GOLD ANALYTICS β
β (driver_performance, β
β constructor_standings, lap_progression)β
β β’ Window functions & rankings β
β β’ Lap consistency (std deviation) β
β β’ Rolling 5-lap average (time-series) β
βββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β GOLD QUALITY GATE (Soda Core) β
β β’ Rank integrity (starts at 1, unique) β
β β’ Lap time sanity bounds β
β β’ Null checks on all metric columns β
β β’ Audit column population β
βββββββββββββββββββββββββββββββββββββββββββ
The pipeline runs as a 5-task Databricks job with linear dependencies. A failed DQ gate raises an exception and halts all downstream tasks.
Ingest Bronze β Transform Silver β Validate Silver β Build Gold β Validate Gold
| Task | Notebook | Description |
|---|---|---|
| Ingest Bronze | 01_Bronze_Ingestion |
Retry-safe OpenF1 API ingestion |
| Transform Silver | 02_Silver_Transformation |
Schema validation, enrichment, quality flagging |
| Validate Silver | 03_Silver_Quality |
Soda DQ gate β halts pipeline on failure |
| Build Gold | 04_Gold_Analytics |
Leaderboard, constructor standings, lap progression |
| Validate Gold | 05_Gold_Quality |
Soda DQ gate β halts pipeline on failure |
The Databricks Workflow definition is exported as JSON in
workflow/β it can be imported directly into any Databricks workspace via Workflows β Create Job β Import.
- Fetches Sessions, Laps, Drivers, and Telemetry from the OpenF1 API
- Retry logic: 3 attempts with 5s delay and 30s timeout per request β resilient to flaky free-tier APIs
- Smart type handling: only object/mixed columns are stringified; numeric types (lap times, speeds) are preserved natively
- Safe session resolution: filters by
session_type = Racebefore selecting latest session key β no fragile[-1]index assumptions - Audit columns:
ingested_attimestamp +source_urlon every table - Idempotent:
overwrite+overwriteSchemamode handles re-runs and schema evolution
- Schema drift guard:
assert_columns()validates expected fields exist before any transformation β catches API changes immediately - Proper type casting: uses Spark's
IntegerType,FloatType,BooleanTypeexplicitly β no inference guesswork - Driver deduplication:
dropDuplicates(["driver_number"])before the join prevents row multiplication from multi-segment API responses - Quality flagging: introduces
is_valid_lapboolean column β flags pit-out laps and anomalous durations without dropping data, letting downstream layers decide - Carries
country_codeandheadshot_urlforward for potential dashboard use
- Runs 11 SodaCL checks against
cleaned_sessionsandenriched_laps - Validates: row counts, session type constraints, primary key uniqueness, null integrity, lap duration bounds, audit column population
- Raises exception on any failure β downstream Gold tasks are skipped automatically by Databricks Workflows
Produces three purpose-built Delta tables from a single Silver read:
| Table | Description |
|---|---|
driver_performance_metrics |
Per-driver leaderboard: fastest lap, avg pace, median pace, lap consistency (std dev), position rank |
constructor_standings |
Team-level summary: best lap, avg team pace, total laps, constructor rank |
lap_progression |
Lap-by-lap time series with rolling 5-lap average β ready for line charts |
- Runs 30+ SodaCL checks across all three Gold tables
- Validates: rank integrity, metric nulls, lap time sanity bounds, rolling average consistency, audit column population
- Raises exception on any failure β prevents corrupted Gold data from reaching dashboards
All pipeline logic is encapsulated in reusable modules under modules/,
keeping notebooks thin and logic testable in isolation.
| Module | Description |
|---|---|
api_client.py |
OpenF1 REST API client with retry logic |
f1_helpers.py |
Shared F1 domain helpers used across layers |
bronze_helpers.py |
Bronze ingestion helpers |
silver_helpers.py |
Silver layer helpers |
silver_transforms.py |
Silver transformation logic |
gold_helpers.py |
Gold layer helpers |
gold_transforms.py |
Gold analytics transformation logic |
Tests are executed directly from Databricks notebooks rather than the command line, as the test suite requires a live Spark session and access to the workspace filesystem.
| Notebook | Description |
|---|---|
notebooks/utilities/Run_Unit_Tests |
Runs all unit tests across all modules |
notebooks/utilities/Run_Integration_Tests |
Runs integration tests against live Delta tables |
Run 00_Setup_Catalog and at least one full pipeline pass before executing integration tests,
as they depend on the Silver and Gold tables existing in the metastore.
| Rank | Driver | Team | Fastest Lap (s) | Avg Pace (s) | Consistency (Ο) |
|---|---|---|---|---|---|
| 1 | Charles LECLERC | Ferrari | 86.720 | 88.790 | 1.23 |
| 2 | Oscar PIASTRI | McLaren | 86.760 | 88.980 | 1.31 |
| 3 | Max VERSTAPPEN | Red Bull Racing | 87.620 | 88.750 | 1.18 |
| 4 | Kimi ANTONELLI | Mercedes | 88.020 | 90.200 | 1.67 |
| Rank | Team | Best Lap (s) | Avg Team Pace (s) |
|---|---|---|---|
| 1 | Ferrari | 86.720 | 88.790 |
| 2 | McLaren | 86.760 | 88.980 |
| 3 | Red Bull Racing | 87.620 | 88.750 |
| 4 | Mercedes | 88.020 | 90.200 |
- Retry-safe ingestion: configurable retries with delay and timeout β no silent failures on flaky APIs
- Schema drift detection: assertion guards at every layer boundary catch upstream API changes before they corrupt downstream tables
- Data quality gates: Soda Core checks at Silver and Gold boundaries β pipeline halts automatically on failure
- Quality flagging over hard-dropping:
is_valid_lappreserves all data in Silver; Gold filters only when computing metrics - Window function analytics:
dense_rank()for leaderboards,stddev()for consistency scoring, rolling averages for time-series - Idempotent pipelines:
overwrite+overwriteSchemaon all writes β re-runnable with no side effects - Structured logging: timestamped log output with row counts and quality summaries at every step
- Single source of truth config: all environment constants (
CATALOG,SCHEMA,YEAR, thresholds) centralised inconfig/config.pyβ referenced by notebooks, modules, and Soda checks alike - Separation of concerns: code quality validated by pytest in CI/CD; data quality validated by Soda in the pipeline
- Full orchestration: end-to-end pipeline scheduled via Databricks Workflows with task-level dependency and failure handling
F1-Pulse/
βββ notebooks/
β βββ utilities/
β β βββ Run_Unit_Tests.py
β β βββ Run_Integration_Tests.py
β βββ 00_Setup_Catalog.py
β βββ 01_Bronze_Ingestion.py
β βββ 02_Silver_Transformation.py
β βββ 03_Silver_Quality.py
β βββ 04_Gold_Analytics.py
β βββ 05_Gold_Quality.py
βββ modules/
β βββ api_client.py
β βββ bronze_helpers.py
β βββ f1_helpers.py
β βββ silver_helpers.py
β βββ silver_transforms.py
β βββ gold_helpers.py
β βββ gold_transforms.py
βββ tests/
β βββ unit_tests/
β β βββ conftest.py
β β βββ path_setup.py
β β βββ test_api_client.py
β β βββ test_bronze_helpers.py
β β βββ test_f1_helpers.py
β β βββ test_silver_helpers.py
β β βββ test_silver_transforms.py
β β βββ test_gold_helpers.py
β β βββ test_gold_transforms.py
β βββ integration_tests/
β βββ conftest.py
β βββ test_integration_bronze_silver.py
β βββ test_integration_silver_gold.py
βββ soda/
β βββ checks_silver.yml
β βββ checks_gold.yml
βββ config/
β βββ config.py
βββ workflow/
β βββ F1-Pulse_Medallion_Pipeline.json
βββ README.md
- Clone the repo and sync to your Databricks workspace via Repos
- Run
00_Setup_Catalog.pyonce to initialise thef1_projectcatalog and schemas - Trigger the
F1-Pulse | Medallion PipelineDatabricks job for all subsequent runs - The job runs tasks in sequence β any DQ failure halts downstream tasks automatically
No API key required β OpenF1 is a free, open REST API.
Built with β€οΈ for Formula 1 & Data Engineering