A local demonstration of Apache Iceberg capabilities using Docker, structured as a minimal data mesh with domain-oriented data ownership.
This demo implements a minimal data mesh with:
| Principle | Implementation |
|---|---|
| Domain Ownership | Separate namespaces: customers_domain, orders_domain |
| Data as a Product | YAML manifests defining ownership, SLAs, quality checks |
| Self-Serve Platform | Iceberg + Trino + MinIO as shared infrastructure |
| GenAI Tool Support | MCP Interface for AI-assisted queries |
| Federated Governance | Not Yet Implemented |
This project demonstrates:
- Apache Iceberg table format with partitioning
- Data Mesh principles: domain ownership, data as a product
- Trino for SQL queries
- PyIceberg for data generation
- MinIO for S3-compatible storage (host filesystem mounted)
- Apache Superset for visualization
- Time Travel and Snapshot Management features
+-------------------------+
| Superset |
| (dashboard UI) |
+-------------------------+
|
|
v
+-------------------+ +-------------------------+
| python scripts | | Trino |
| (writes) | | (query engine) |
+-------------------+ +-------------------------+
\ /
\ /
\ /
\ /
v v
+---------------------------------------+
| Iceberg |--+
| (catalog + data format + snapshot | |
| management + etc...) | |
+---------------------------------------+ |
+----------------------------------------+
|
|
v
+-------------------------+
| MinIO |--+
| (S3 storage engine) | |
+-------------------------+ |
+--------------------------+
| Service | Port | Description |
|---|---|---|
| MinIO Console | 9001 | Object storage UI (admin/password) |
| MinIO API | 9000 | S3-compatible API |
| REST Catalog | 8181 | Iceberg catalog |
| Trino | 8080 | Query engine |
| Superset | 8088 | Visualization (admin/admin) |
- Docker and Docker Compose
- Python 3.13+
- uv (Python package manager)
- Make
# Install Python dependencies
uv sync
# Start all services
make up
# Wait for services to be ready, then verify
make verify-infra
# Create tables
make create-tables
# Generate small dataset (1K customers, 10K orders)
make generate-small
# Open Trino CLI
make trino
# Access Superset at http://localhost:8088For a comprehensive step-by-step guide through all features, see DEMO.md.
| Target | Description |
|---|---|
make up |
Start all Docker services |
make down |
Stop all services |
make logs |
Tail service logs |
make ps |
Show service status |
make clean |
Remove all data and volumes |
make wait-healthy |
Wait for all services to be ready |
| Target | Description |
|---|---|
make trino |
Open Trino CLI |
make create-tables |
Initialize Iceberg tables |
make setup-superset |
Configure Superset with Trino |
make generate-small |
Generate 1K customers, 10K orders |
make generate-full |
Generate 1M customers, 10M orders |
| Target | Description |
|---|---|
make demo-time-travel |
Run time travel demonstration |
make demo-snapshots |
Run snapshot management demo |
| Target | Description |
|---|---|
make load-test-writes |
Run write throughput tests |
make load-test-reads |
Run read load tests (Locust Web UI) |
| Target | Description |
|---|---|
make verify-infra |
Verify infrastructure is up |
make verify-tables |
Verify tables are created |
make verify-data |
Verify data generation |
make verify-queries |
Verify queries and Superset |
make verify-iceberg |
Verify Iceberg features |
make verify-loadtest |
Verify load testing tools |
make verify-all |
Run all verifications |
erDiagram
customers {
int customer_id PK
string email
string first_name
string last_name
timestamp created_at
string country "partition key"
}
orders {
int order_id PK
int customer_id FK
timestamp order_date "partition key (month)"
string status "pending|shipped|delivered|cancelled"
decimal total_amount
int items_count
}
customers ||--o{ orders : "has many"
These entities are described in the following data product mesh:
data_products/
├── customers_domain/
│ └── customers.yaml # Customer data product manifest
└── orders_domain/
└── orders.yaml # Orders data product manifest
Query historical data using Iceberg's snapshot-based time travel:
-- Query at a specific snapshot
SELECT * FROM iceberg.orders_domain.orders
FOR VERSION AS OF 1234567890123456789
LIMIT 10;
-- Query at a specific timestamp
SELECT * FROM iceberg.orders_domain.orders
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 10:00:00'
LIMIT 10;Run make demo-time-travel for an interactive demonstration.
Iceberg maintains snapshots for every table modification:
-- List all snapshots
SELECT * FROM iceberg.orders_domain."orders$snapshots";
-- View table history
SELECT * FROM iceberg.orders_domain."orders$history";
-- View data files
SELECT * FROM iceberg.orders_domain."orders$files";
-- Expire old snapshots (keeps last 7 days)
ALTER TABLE iceberg.orders_domain.orders
EXECUTE expire_snapshots(retention_threshold => '7d');Run make demo-snapshots for a comprehensive tour of snapshot metadata.
Test write throughput with configurable parameters:
# Basic test: 10K records
uv run python scripts/load_tests/write_load_test.py
# Custom configuration
uv run python scripts/load_tests/write_load_test.py \
--records 50000 \
--batch-size 5000 \
--workers 4
# Dry run (no actual writes)
uv run python scripts/load_tests/write_load_test.py --dry-run
# Output results to JSON
uv run python scripts/load_tests/write_load_test.py --output results.jsonRun concurrent query load tests with Locust:
# Start Locust web UI at http://localhost:8089
make load-test-reads
# Or run headless
uv run locust -f scripts/load_tests/locustfile.py \
--host http://localhost:8080 \
--headless -u 10 -r 2 -t 60sQuery scenarios tested:
- Simple SELECT with LIMIT
- Aggregation queries (COUNT, SUM, GROUP BY)
- JOIN queries (customers + orders)
- Time range filters (partition pruning)
This project includes a Model Context Protocol (MCP) server for Trino, enabling AI assistants like Claude Code to query the data warehouse directly.
| Tool | Description |
|---|---|
show_catalogs |
List all available catalogs |
show_schemas |
List schemas in a catalog |
show_tables |
List tables in a schema |
describe_table |
Show table columns and types |
execute_query |
Run SQL queries and return results |
show_catalog_tree |
Hierarchical view of all catalogs/schemas/tables |
show_create_table |
Show CREATE TABLE statement |
show_stats |
Show table statistics |
show_snapshots |
Show Iceberg table snapshots |
show_table_history |
Show Iceberg table changelog |
optimize |
Optimize Iceberg table data files |
expire_snapshots |
Remove old snapshots |
Add to your .mcp.json:
{
"mcpServers": {
"trino": {
"command": "uv",
"args": ["run", "python", "scripts/run_mcp_trino.py"],
"cwd": "/path/to/iceburg"
}
}
}List all tables:
> Use your trino tool to find all of the available tables under the iceberg catalog
⏺ Here are all the tables under the iceberg catalog:
customers_domain
| Table |
|-----------|
| customers |
orders_domain
| Table |
|--------|
| orders |
information_schema
| Table |
|------------------|
| applicable_roles |
| columns |
| enabled_roles |
| roles |
| schemata |
| table_privileges |
| tables |
| views |
system
| Table |
|----------------|
| iceberg_tables |
---
Business Tables:
- iceberg.customers_domain.customers
- iceberg.orders_domain.orders
Metadata Tables:
- iceberg.information_schema.* - Standard SQL metadata views
- iceberg.system.iceberg_tables - Iceberg-specific table metadata
Join orders with customers (last 20 orders):
> show me the last 20 orders with their associated customer information
∴ Thinking…
I got the results. Let me format this nicely for the user as a table.
⏺ Here are the last 20 orders with their customer information:
| Order ID | Order Date | Status | Total | Items | Customer | Email | Country |
|----------|---------------------|-----------|---------|-------|-----------------|---------------------------------|---------|
| 8208 | 2024-12-31 23:59:31 | delivered | $17.92 | 1 | Madison Watson | madison.watson.944@example.com | FR |
| 1285 | 2024-12-31 21:11:13 | delivered | $380.59 | 7 | Justin Lindsey | justin.lindsey.305@example.com | US |
| 3039 | 2024-12-31 18:56:31 | shipped | $443.08 | 9 | Heidi Armstrong | heidi.armstrong.272@example.com | US |
| 5591 | 2024-12-31 17:07:42 | delivered | $349.08 | 6 | Deborah Lynch | deborah.lynch.19@example.com | CA |
| 8382 | 2024-12-31 16:01:47 | delivered | $168.98 | 8 | Robert Fowler | robert.fowler.539@example.com | FR |
| 7384 | 2024-12-31 15:53:57 | shipped | $55.27 | 3 | Kenneth Mendez | kenneth.mendez.630@example.com | FR |
| 6142 | 2024-12-31 15:49:29 | shipped | $470.22 | 4 | Theodore Reid | theodore.reid.476@example.com | AU |
| 7710 | 2024-12-31 13:24:50 | shipped | $241.43 | 10 | Eric Yu | eric.yu.77@example.com | AU |
| 4932 | 2024-12-31 13:20:56 | delivered | $468.72 | 5 | Peter Gamble | peter.gamble.989@example.com | FR |
| 6437 | 2024-12-31 12:27:43 | delivered | $313.46 | 1 | William Davis | william.davis.30@example.com | DE |
| 8662 | 2024-12-31 11:46:47 | delivered | $225.37 | 10 | Ashlee Sparks | ashlee.sparks.291@example.com | US |
| 4600 | 2024-12-31 11:30:42 | delivered | $250.51 | 8 | David Lee | david.lee.208@example.com | US |
| 7668 | 2024-12-31 06:27:23 | shipped | $119.17 | 4 | Mitchell Ramos | mitchell.ramos.333@example.com | FR |
| 1538 | 2024-12-31 02:27:33 | delivered | $421.98 | 10 | Darren Shepard | darren.shepard.783@example.com | FR |
| 1488 | 2024-12-31 02:07:47 | delivered | $58.38 | 3 | Nicholas Walton | nicholas.walton.172@example.com | DE |
| 8817 | 2024-12-30 23:32:15 | cancelled | $342.65 | 2 | Erika Rush | erika.rush.505@example.com | US |
| 6101 | 2024-12-30 20:59:10 | delivered | $357.81 | 5 | Michael Cox | michael.cox.437@example.com | US |
| 5410 | 2024-12-30 20:40:19 | delivered | $386.63 | 2 | Julie Phillips | julie.phillips.873@example.com | US |
| 2877 | 2024-12-30 16:56:15 | delivered | $110.30 | 3 | Michael Carson | michael.carson.818@example.com | US |
| 7512 | 2024-12-30 15:43:37 | delivered | $223.36 | 6 | Rhonda Garrett | rhonda.garrett.670@example.com | UK |
---
Summary:
- Date range: Dec 30-31, 2024
- Status breakdown: 15 delivered, 4 shipped, 1 cancelled
- Countries represented: US (8), FR (6), DE (2), AU (2), CA (1), UK (1)
This query joins the two domain tables to show order details alongside customer information, demonstrating cross-domain data access while respecting domain boundaries.
Data is stored on your host filesystem at ./data/minio/warehouse/:
data/minio/warehouse/
├── customers_domain/
│ └── customers/ # Parquet files partitioned by country
└── orders_domain/
└── orders/ # Parquet files partitioned by month
MIT