Skip to content

A local demonstration of Apache Iceberg capabilities using Docker, structured as a minimal data mesh with domain-oriented data ownership.

Notifications You must be signed in to change notification settings

SayreBlades/DataMeshing

Repository files navigation

Data Mesh (Apache Iceberg Demo)

A local demonstration of Apache Iceberg capabilities using Docker, structured as a minimal data mesh with domain-oriented data ownership.

This demo implements a minimal data mesh with:

Principle Implementation
Domain Ownership Separate namespaces: customers_domain, orders_domain
Data as a Product YAML manifests defining ownership, SLAs, quality checks
Self-Serve Platform Iceberg + Trino + MinIO as shared infrastructure
GenAI Tool Support MCP Interface for AI-assisted queries
Federated Governance Not Yet Implemented

Overview

This project demonstrates:

  • Apache Iceberg table format with partitioning
  • Data Mesh principles: domain ownership, data as a product
  • Trino for SQL queries
  • PyIceberg for data generation
  • MinIO for S3-compatible storage (host filesystem mounted)
  • Apache Superset for visualization
  • Time Travel and Snapshot Management features

Demo Service Architecture

                         +-------------------------+
                         |        Superset         |
                         |     (dashboard UI)      |
                         +-------------------------+
                                      |
                                      |
                                      v
+-------------------+    +-------------------------+
|   python scripts  |    |          Trino          |
|      (writes)     |    |      (query engine)     |
+-------------------+    +-------------------------+
          \                       /
           \                     /
            \                   /
             \                 /
              v               v
        +---------------------------------------+
        |                Iceberg                |--+
        |  (catalog + data format + snapshot    |  |
        |         management + etc...)          |  |
        +---------------------------------------+  |
          +----------------------------------------+
                            |
                            |
                            v
                  +-------------------------+
                  |          MinIO          |--+
                  |   (S3 storage engine)   |  |
                  +-------------------------+  |
                    +--------------------------+
Service Port Description
MinIO Console 9001 Object storage UI (admin/password)
MinIO API 9000 S3-compatible API
REST Catalog 8181 Iceberg catalog
Trino 8080 Query engine
Superset 8088 Visualization (admin/admin)

Prerequisites

  • Docker and Docker Compose
  • Python 3.13+
  • uv (Python package manager)
  • Make

Quick Start

# Install Python dependencies
uv sync

# Start all services
make up

# Wait for services to be ready, then verify
make verify-infra

# Create tables
make create-tables

# Generate small dataset (1K customers, 10K orders)
make generate-small

# Open Trino CLI
make trino

# Access Superset at http://localhost:8088

Demo Walkthrough

For a comprehensive step-by-step guide through all features, see DEMO.md.


Make Targets

Infrastructure

Target Description
make up Start all Docker services
make down Stop all services
make logs Tail service logs
make ps Show service status
make clean Remove all data and volumes
make wait-healthy Wait for all services to be ready

Database & Data

Target Description
make trino Open Trino CLI
make create-tables Initialize Iceberg tables
make setup-superset Configure Superset with Trino
make generate-small Generate 1K customers, 10K orders
make generate-full Generate 1M customers, 10M orders

Demos

Target Description
make demo-time-travel Run time travel demonstration
make demo-snapshots Run snapshot management demo

Load Testing

Target Description
make load-test-writes Run write throughput tests
make load-test-reads Run read load tests (Locust Web UI)

Verification

Target Description
make verify-infra Verify infrastructure is up
make verify-tables Verify tables are created
make verify-data Verify data generation
make verify-queries Verify queries and Superset
make verify-iceberg Verify Iceberg features
make verify-loadtest Verify load testing tools
make verify-all Run all verifications

Data Model

erDiagram
    customers {
        int customer_id PK
        string email
        string first_name
        string last_name
        timestamp created_at
        string country "partition key"
    }

    orders {
        int order_id PK
        int customer_id FK
        timestamp order_date "partition key (month)"
        string status "pending|shipped|delivered|cancelled"
        decimal total_amount
        int items_count
    }

    customers ||--o{ orders : "has many"
Loading

These entities are described in the following data product mesh:

data_products/
├── customers_domain/
│   └── customers.yaml     # Customer data product manifest
└── orders_domain/
    └── orders.yaml        # Orders data product manifest

Iceberg Features

Time Travel

Query historical data using Iceberg's snapshot-based time travel:

-- Query at a specific snapshot
SELECT * FROM iceberg.orders_domain.orders
FOR VERSION AS OF 1234567890123456789
LIMIT 10;

-- Query at a specific timestamp
SELECT * FROM iceberg.orders_domain.orders
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 10:00:00'
LIMIT 10;

Run make demo-time-travel for an interactive demonstration.

Snapshot Management

Iceberg maintains snapshots for every table modification:

-- List all snapshots
SELECT * FROM iceberg.orders_domain."orders$snapshots";

-- View table history
SELECT * FROM iceberg.orders_domain."orders$history";

-- View data files
SELECT * FROM iceberg.orders_domain."orders$files";

-- Expire old snapshots (keeps last 7 days)
ALTER TABLE iceberg.orders_domain.orders
  EXECUTE expire_snapshots(retention_threshold => '7d');

Run make demo-snapshots for a comprehensive tour of snapshot metadata.


Load Testing

Write Performance

Test write throughput with configurable parameters:

# Basic test: 10K records
uv run python scripts/load_tests/write_load_test.py

# Custom configuration
uv run python scripts/load_tests/write_load_test.py \
  --records 50000 \
  --batch-size 5000 \
  --workers 4

# Dry run (no actual writes)
uv run python scripts/load_tests/write_load_test.py --dry-run

# Output results to JSON
uv run python scripts/load_tests/write_load_test.py --output results.json

Read Performance (Locust)

Run concurrent query load tests with Locust:

# Start Locust web UI at http://localhost:8089
make load-test-reads

# Or run headless
uv run locust -f scripts/load_tests/locustfile.py \
  --host http://localhost:8080 \
  --headless -u 10 -r 2 -t 60s

Query scenarios tested:

  • Simple SELECT with LIMIT
  • Aggregation queries (COUNT, SUM, GROUP BY)
  • JOIN queries (customers + orders)
  • Time range filters (partition pruning)

MCP Interface (Claude Code Integration)

This project includes a Model Context Protocol (MCP) server for Trino, enabling AI assistants like Claude Code to query the data warehouse directly.

Available MCP Tools

Tool Description
show_catalogs List all available catalogs
show_schemas List schemas in a catalog
show_tables List tables in a schema
describe_table Show table columns and types
execute_query Run SQL queries and return results
show_catalog_tree Hierarchical view of all catalogs/schemas/tables
show_create_table Show CREATE TABLE statement
show_stats Show table statistics
show_snapshots Show Iceberg table snapshots
show_table_history Show Iceberg table changelog
optimize Optimize Iceberg table data files
expire_snapshots Remove old snapshots

Configuration

Add to your .mcp.json:

{
  "mcpServers": {
    "trino": {
      "command": "uv",
      "args": ["run", "python", "scripts/run_mcp_trino.py"],
      "cwd": "/path/to/iceburg"
    }
  }
}

Example Queries

List all tables:


> Use your trino tool to find all of the available tables under the iceberg catalog 

⏺ Here are all the tables under the iceberg catalog:

  customers_domain

  | Table     |
  |-----------|
  | customers |

  orders_domain

  | Table  |
  |--------|
  | orders |

  information_schema

  | Table            |
  |------------------|
  | applicable_roles |
  | columns          |
  | enabled_roles    |
  | roles            |
  | schemata         |
  | table_privileges |
  | tables           |
  | views            |

  system

  | Table          |
  |----------------|
  | iceberg_tables |

  ---
  Business Tables:
  - iceberg.customers_domain.customers
  - iceberg.orders_domain.orders

  Metadata Tables:
  - iceberg.information_schema.* - Standard SQL metadata views
  - iceberg.system.iceberg_tables - Iceberg-specific table metadata

Join orders with customers (last 20 orders):

> show me the last 20 orders with their associated customer information 

∴ Thinking…

  I got the results. Let me format this nicely for the user as a table.

⏺ Here are the last 20 orders with their customer information:

  | Order ID | Order Date          | Status    | Total   | Items | Customer        | Email                           | Country |
  |----------|---------------------|-----------|---------|-------|-----------------|---------------------------------|---------|
  | 8208     | 2024-12-31 23:59:31 | delivered | $17.92  | 1     | Madison Watson  | madison.watson.944@example.com  | FR      |
  | 1285     | 2024-12-31 21:11:13 | delivered | $380.59 | 7     | Justin Lindsey  | justin.lindsey.305@example.com  | US      |
  | 3039     | 2024-12-31 18:56:31 | shipped   | $443.08 | 9     | Heidi Armstrong | heidi.armstrong.272@example.com | US      |
  | 5591     | 2024-12-31 17:07:42 | delivered | $349.08 | 6     | Deborah Lynch   | deborah.lynch.19@example.com    | CA      |
  | 8382     | 2024-12-31 16:01:47 | delivered | $168.98 | 8     | Robert Fowler   | robert.fowler.539@example.com   | FR      |
  | 7384     | 2024-12-31 15:53:57 | shipped   | $55.27  | 3     | Kenneth Mendez  | kenneth.mendez.630@example.com  | FR      |
  | 6142     | 2024-12-31 15:49:29 | shipped   | $470.22 | 4     | Theodore Reid   | theodore.reid.476@example.com   | AU      |
  | 7710     | 2024-12-31 13:24:50 | shipped   | $241.43 | 10    | Eric Yu         | eric.yu.77@example.com          | AU      |
  | 4932     | 2024-12-31 13:20:56 | delivered | $468.72 | 5     | Peter Gamble    | peter.gamble.989@example.com    | FR      |
  | 6437     | 2024-12-31 12:27:43 | delivered | $313.46 | 1     | William Davis   | william.davis.30@example.com    | DE      |
  | 8662     | 2024-12-31 11:46:47 | delivered | $225.37 | 10    | Ashlee Sparks   | ashlee.sparks.291@example.com   | US      |
  | 4600     | 2024-12-31 11:30:42 | delivered | $250.51 | 8     | David Lee       | david.lee.208@example.com       | US      |
  | 7668     | 2024-12-31 06:27:23 | shipped   | $119.17 | 4     | Mitchell Ramos  | mitchell.ramos.333@example.com  | FR      |
  | 1538     | 2024-12-31 02:27:33 | delivered | $421.98 | 10    | Darren Shepard  | darren.shepard.783@example.com  | FR      |
  | 1488     | 2024-12-31 02:07:47 | delivered | $58.38  | 3     | Nicholas Walton | nicholas.walton.172@example.com | DE      |
  | 8817     | 2024-12-30 23:32:15 | cancelled | $342.65 | 2     | Erika Rush      | erika.rush.505@example.com      | US      |
  | 6101     | 2024-12-30 20:59:10 | delivered | $357.81 | 5     | Michael Cox     | michael.cox.437@example.com     | US      |
  | 5410     | 2024-12-30 20:40:19 | delivered | $386.63 | 2     | Julie Phillips  | julie.phillips.873@example.com  | US      |
  | 2877     | 2024-12-30 16:56:15 | delivered | $110.30 | 3     | Michael Carson  | michael.carson.818@example.com  | US      |
  | 7512     | 2024-12-30 15:43:37 | delivered | $223.36 | 6     | Rhonda Garrett  | rhonda.garrett.670@example.com  | UK      |

  ---
  Summary:
  - Date range: Dec 30-31, 2024
  - Status breakdown: 15 delivered, 4 shipped, 1 cancelled
  - Countries represented: US (8), FR (6), DE (2), AU (2), CA (1), UK (1)

This query joins the two domain tables to show order details alongside customer information, demonstrating cross-domain data access while respecting domain boundaries.


Local Data Storage

Data is stored on your host filesystem at ./data/minio/warehouse/:

data/minio/warehouse/
├── customers_domain/
│   └── customers/        # Parquet files partitioned by country
└── orders_domain/
    └── orders/           # Parquet files partitioned by month

License

MIT

About

A local demonstration of Apache Iceberg capabilities using Docker, structured as a minimal data mesh with domain-oriented data ownership.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published