Skip to content

random-droid/Database-Systems-Lab

Repository files navigation

Database Systems Lab

10 hands-on experiments that reveal how databases actually work.

Each experiment poses a real question, runs a live workload against Postgres, DuckDB, and PySpark on a 50M-row dataset, and surfaces a smoking gun — the empirical proof behind the theory.

Every result maps to a lecture from CMU 15-721 Advanced Database Systems by Andy Pavlo (free online).


The Experiments

# Question Smoking Gun Lecture
1 Why does the same query run 6× faster the second time? 18.3s cold → 3.1s hot (5.9× speedup) L05: Buffer Pool
2 What happens when Postgres runs out of memory during a join? 19,204 temp blocks written to disk L06: External Merge Sort
3 Is parsing embedded JSON really that expensive? 1 GB disk spill (STRING) vs 0 bytes (VARIANT) L03: PAX Storage
4 Can physical row ordering alone make a query 3× faster? 6.7× speedup from CLUSTER, 75% row groups skipped L04: Storage Models
5 Can you lose 500K rows without getting any error? 500,000 rows silently gone in Parquet; Delta raises ConcurrentAppendException L13–15: OCC / MVCC
6 Why is DuckDB 25× faster than Python on the same hardware? 265ms (DuckDB SIMD) vs 6,785ms (Python for-loop) L10–12: Vectorized Execution
7 How much does columnar storage actually shrink your data? 6.66× smaller, 7.24× faster scan (Parquet/Zstd vs CSV) L03: Compression
8 Why are DuckDB window functions so much faster than pandas? 7 window ops fused into one sorted pass L11: Window Functions
9 Does adding a WHERE clause always make a query faster? 1.57× speedup from double predicate pushdown L07–08: Query Optimization
10 What happens when 90% of your data lands in one partition? 4.5× partition imbalance, straggler thread does all the work L09: Skew Handling

Architecture

olap-benchmark/          # Python benchmark layer
  data_generator.py      # generates 50M-row e-commerce dataset
  benchmarks/            # 10 benchmark scripts (UC1–UC10)
  utils/concept_validator.py  # maps metrics → CMU 15-721 annotations
  results/               # pre-run JSON results (read by the API)

artifacts/api-server/    # Express API
  GET  /api/benchmarks/results/:useCase   # fetch result JSON
  POST /api/benchmarks/run/:useCase       # spawn Python benchmark live
  GET  /api/benchmarks/logs/:useCase      # SSE stream of live stdout

artifacts/dashboard/     # React + Vite frontend
  10 experiment cards, each with:
    - "Before You Run" theory accordion
    - Live run button + log stream
    - Smoking Gun panel
    - CMU 15-721 lecture link + Pavlo quote

Stack: pnpm monorepo · TypeScript · React · Express · Python 3 · DuckDB · Postgres · PySpark · Delta Lake


Running Locally

# Install dependencies
pnpm install

# Start the API server (port 8080)
pnpm --filter @workspace/api-server run dev

# Start the dashboard (port from $PORT env var)
pnpm --filter @workspace/dashboard run dev

Python dependencies (benchmarks run on-demand via the API):

pip install duckdb psycopg2-binary pyspark delta-spark pandas pyarrow numpy

Pre-run results are included in olap-benchmark/results/ so the dashboard is fully functional without re-running the benchmarks.


Dataset

50M-row synthetic e-commerce orders table generated by data_generator.py:

  • order_id, customer_id, product_id, region, category
  • revenue (float), order_date (date), metadata (JSON)
  • 5 regions · 20 product categories · realistic skew options

CMU 15-721 Coverage

Lecture Topic Experiment
L03 Storage Models & Compression UC3 (VARIANT), UC7 (Compression)
L04 Database Storage UC4 (Clustering)
L05 Buffer Pool Management UC1 (Cache warmup)
L06 Hash Tables & External Sort UC2 (Memory pressure)
L07–08 Query Optimization UC9 (Predicate pushdown)
L09 Join Algorithms UC2 (Join strategies), UC10 (Skew)
L10–12 Vectorized Execution & SIMD UC6 (Vectorization)
L11 Advanced Operators UC8 (Window functions)
L13–15 OCC / MVCC / Concurrency Control UC5 (ACID integrity)

Course: CMU 15-721 Advanced Database Systems · Instructor: Andy Pavlo


Design Philosophy

Theory before proof — every experiment follows the same arc:

  1. Open question ("Can you lose 500K rows without an error?")
  2. Theory primer (what the lecture says will happen)
  3. Run the workload
  4. Find the smoking gun
  5. Understand the mechanism

Databases aren't magic. They're mechanisms. This lab makes them visible.

About

10 hands-on experiments that reveal how databases actually work. Each experiment poses a real question — run the workload, find the smoking gun, and understand the WHY. Powered by CMU 15-721 Advanced Database Systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors