10 hands-on experiments that reveal how databases actually work.
Each experiment poses a real question, runs a live workload against Postgres, DuckDB, and PySpark on a 50M-row dataset, and surfaces a smoking gun — the empirical proof behind the theory.
Every result maps to a lecture from CMU 15-721 Advanced Database Systems by Andy Pavlo (free online).
| # | Question | Smoking Gun | Lecture |
|---|---|---|---|
| 1 | Why does the same query run 6× faster the second time? | 18.3s cold → 3.1s hot (5.9× speedup) | L05: Buffer Pool |
| 2 | What happens when Postgres runs out of memory during a join? | 19,204 temp blocks written to disk | L06: External Merge Sort |
| 3 | Is parsing embedded JSON really that expensive? | 1 GB disk spill (STRING) vs 0 bytes (VARIANT) | L03: PAX Storage |
| 4 | Can physical row ordering alone make a query 3× faster? | 6.7× speedup from CLUSTER, 75% row groups skipped | L04: Storage Models |
| 5 | Can you lose 500K rows without getting any error? | 500,000 rows silently gone in Parquet; Delta raises ConcurrentAppendException |
L13–15: OCC / MVCC |
| 6 | Why is DuckDB 25× faster than Python on the same hardware? | 265ms (DuckDB SIMD) vs 6,785ms (Python for-loop) | L10–12: Vectorized Execution |
| 7 | How much does columnar storage actually shrink your data? | 6.66× smaller, 7.24× faster scan (Parquet/Zstd vs CSV) | L03: Compression |
| 8 | Why are DuckDB window functions so much faster than pandas? | 7 window ops fused into one sorted pass | L11: Window Functions |
| 9 | Does adding a WHERE clause always make a query faster? | 1.57× speedup from double predicate pushdown | L07–08: Query Optimization |
| 10 | What happens when 90% of your data lands in one partition? | 4.5× partition imbalance, straggler thread does all the work | L09: Skew Handling |
olap-benchmark/ # Python benchmark layer
data_generator.py # generates 50M-row e-commerce dataset
benchmarks/ # 10 benchmark scripts (UC1–UC10)
utils/concept_validator.py # maps metrics → CMU 15-721 annotations
results/ # pre-run JSON results (read by the API)
artifacts/api-server/ # Express API
GET /api/benchmarks/results/:useCase # fetch result JSON
POST /api/benchmarks/run/:useCase # spawn Python benchmark live
GET /api/benchmarks/logs/:useCase # SSE stream of live stdout
artifacts/dashboard/ # React + Vite frontend
10 experiment cards, each with:
- "Before You Run" theory accordion
- Live run button + log stream
- Smoking Gun panel
- CMU 15-721 lecture link + Pavlo quote
Stack: pnpm monorepo · TypeScript · React · Express · Python 3 · DuckDB · Postgres · PySpark · Delta Lake
# Install dependencies
pnpm install
# Start the API server (port 8080)
pnpm --filter @workspace/api-server run dev
# Start the dashboard (port from $PORT env var)
pnpm --filter @workspace/dashboard run devPython dependencies (benchmarks run on-demand via the API):
pip install duckdb psycopg2-binary pyspark delta-spark pandas pyarrow numpyPre-run results are included in olap-benchmark/results/ so the dashboard is fully functional without re-running the benchmarks.
50M-row synthetic e-commerce orders table generated by data_generator.py:
order_id,customer_id,product_id,region,categoryrevenue(float),order_date(date),metadata(JSON)- 5 regions · 20 product categories · realistic skew options
| Lecture | Topic | Experiment |
|---|---|---|
| L03 | Storage Models & Compression | UC3 (VARIANT), UC7 (Compression) |
| L04 | Database Storage | UC4 (Clustering) |
| L05 | Buffer Pool Management | UC1 (Cache warmup) |
| L06 | Hash Tables & External Sort | UC2 (Memory pressure) |
| L07–08 | Query Optimization | UC9 (Predicate pushdown) |
| L09 | Join Algorithms | UC2 (Join strategies), UC10 (Skew) |
| L10–12 | Vectorized Execution & SIMD | UC6 (Vectorization) |
| L11 | Advanced Operators | UC8 (Window functions) |
| L13–15 | OCC / MVCC / Concurrency Control | UC5 (ACID integrity) |
Course: CMU 15-721 Advanced Database Systems · Instructor: Andy Pavlo
Theory before proof — every experiment follows the same arc:
- Open question ("Can you lose 500K rows without an error?")
- Theory primer (what the lecture says will happen)
- Run the workload
- Find the smoking gun
- Understand the mechanism
Databases aren't magic. They're mechanisms. This lab makes them visible.