Database Systems Lab

10 hands-on experiments that reveal how databases actually work.

Each experiment poses a real question, runs a live workload against Postgres, DuckDB, and PySpark on a 50M-row dataset, and surfaces a smoking gun — the empirical proof behind the theory.

Every result maps to a lecture from CMU 15-721 Advanced Database Systems by Andy Pavlo (free online).

The Experiments

#	Question	Smoking Gun	Lecture
1	Why does the same query run 6× faster the second time?	18.3s cold → 3.1s hot (5.9× speedup)	L05: Buffer Pool
2	What happens when Postgres runs out of memory during a join?	19,204 temp blocks written to disk	L06: External Merge Sort
3	Is parsing embedded JSON really that expensive?	1 GB disk spill (STRING) vs 0 bytes (VARIANT)	L03: PAX Storage
4	Can physical row ordering alone make a query 3× faster?	6.7× speedup from CLUSTER, 75% row groups skipped	L04: Storage Models
5	Can you lose 500K rows without getting any error?	500,000 rows silently gone in Parquet; Delta raises `ConcurrentAppendException`	L13–15: OCC / MVCC
6	Why is DuckDB 25× faster than Python on the same hardware?	265ms (DuckDB SIMD) vs 6,785ms (Python for-loop)	L10–12: Vectorized Execution
7	How much does columnar storage actually shrink your data?	6.66× smaller, 7.24× faster scan (Parquet/Zstd vs CSV)	L03: Compression
8	Why are DuckDB window functions so much faster than pandas?	7 window ops fused into one sorted pass	L11: Window Functions
9	Does adding a WHERE clause always make a query faster?	1.57× speedup from double predicate pushdown	L07–08: Query Optimization
10	What happens when 90% of your data lands in one partition?	4.5× partition imbalance, straggler thread does all the work	L09: Skew Handling

Architecture

olap-benchmark/          # Python benchmark layer
  data_generator.py      # generates 50M-row e-commerce dataset
  benchmarks/            # 10 benchmark scripts (UC1–UC10)
  utils/concept_validator.py  # maps metrics → CMU 15-721 annotations
  results/               # pre-run JSON results (read by the API)

artifacts/api-server/    # Express API
  GET  /api/benchmarks/results/:useCase   # fetch result JSON
  POST /api/benchmarks/run/:useCase       # spawn Python benchmark live
  GET  /api/benchmarks/logs/:useCase      # SSE stream of live stdout

artifacts/dashboard/     # React + Vite frontend
  10 experiment cards, each with:
    - "Before You Run" theory accordion
    - Live run button + log stream
    - Smoking Gun panel
    - CMU 15-721 lecture link + Pavlo quote

Stack: pnpm monorepo · TypeScript · React · Express · Python 3 · DuckDB · Postgres · PySpark · Delta Lake

Running Locally

# Install dependencies
pnpm install

# Start the API server (port 8080)
pnpm --filter @workspace/api-server run dev

# Start the dashboard (port from $PORT env var)
pnpm --filter @workspace/dashboard run dev

Python dependencies (benchmarks run on-demand via the API):

pip install duckdb psycopg2-binary pyspark delta-spark pandas pyarrow numpy

Pre-run results are included in olap-benchmark/results/ so the dashboard is fully functional without re-running the benchmarks.

Dataset

50M-row synthetic e-commerce orders table generated by data_generator.py:

order_id, customer_id, product_id, region, category
revenue (float), order_date (date), metadata (JSON)
5 regions · 20 product categories · realistic skew options

CMU 15-721 Coverage

Lecture	Topic	Experiment
L03	Storage Models & Compression	UC3 (VARIANT), UC7 (Compression)
L04	Database Storage	UC4 (Clustering)
L05	Buffer Pool Management	UC1 (Cache warmup)
L06	Hash Tables & External Sort	UC2 (Memory pressure)
L07–08	Query Optimization	UC9 (Predicate pushdown)
L09	Join Algorithms	UC2 (Join strategies), UC10 (Skew)
L10–12	Vectorized Execution & SIMD	UC6 (Vectorization)
L11	Advanced Operators	UC8 (Window functions)
L13–15	OCC / MVCC / Concurrency Control	UC5 (ACID integrity)

Course: CMU 15-721 Advanced Database Systems · Instructor: Andy Pavlo

Design Philosophy

Theory before proof — every experiment follows the same arc:

Open question ("Can you lose 500K rows without an error?")

Theory primer (what the lecture says will happen)

Run the workload

Find the smoking gun

Understand the mechanism

Databases aren't magic. They're mechanisms. This lab makes them visible.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
artifacts		artifacts
attached_assets		attached_assets
lib		lib
olap-benchmark		olap-benchmark
scripts		scripts
.gitignore		.gitignore
.npmrc		.npmrc
.replit		.replit
.replitignore		.replitignore
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
replit.md		replit.md
replit.nix		replit.nix
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Database Systems Lab

The Experiments

Architecture

Running Locally

Dataset

CMU 15-721 Coverage

Design Philosophy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Database Systems Lab

The Experiments

Architecture

Running Locally

Dataset

CMU 15-721 Coverage

Design Philosophy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages