New lab for multi-topic fanout to Iceberg tables#279
New lab for multi-topic fanout to Iceberg tables#279
Conversation
Demonstrates routing batched events to multiple domain-specific Iceberg topics using Wasm transforms. Includes Go producer with franz-go, transform with header-based routing, JSON Schema validation, and Spark/Jupyter for querying Iceberg tables. Known issue: Iceberg tables not being created. Redpanda logs show "type_resolver::errc::registry_error" when trying to write to Iceberg topics configured with value_schema_latest mode. Messages route correctly to all topics and JSON schemas are properly registered, but Iceberg integration fails during schema resolution step. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
✅ Deploy Preview for redpanda-labs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| :env-docker: true | ||
| :page-layout: lab | ||
| :page-categories: Data Transforms, Iceberg, Schema Registry | ||
| :description: Route batched events to multiple Iceberg-enabled topics using Wasm transforms with Avro encoding and Schema Registry wire format, creating a streaming data lakehouse pipeline. |
There was a problem hiding this comment.
- is the wowrd "batched" necessary here?
- "multiple Iceberg-enabled topics" that's an implementation detail so maybe drop it from the description and replace with "multiple iceberg tables" (as the title says) and rather introduce later that "redpanda supports at most one iceberg table per topic so to route to multiple table we are going to use transforms to fanout". Ohterwise, it is not clear why fanout is important, why transforms are necessary, how this lab is different from the other iceberg lab we already have.
|
|
||
| == Overview | ||
|
|
||
| This lab demonstrates how to build a streaming data lakehouse using Redpanda's Wasm data transforms and Iceberg integration. You'll deploy a transform that performs true 1:N fanout, parsing batch messages and routing individual updates from a single input topic to multiple domain-specific output topics, each configured as an Iceberg table for analytics. |
There was a problem hiding this comment.
Similar comment as above (https://github.com/redpanda-data/redpanda-labs/pull/279/changes#r2873829510). I believe the overview should state the problem and why this lab is worth looking into.
|
|
||
| This lab demonstrates how to build a streaming data lakehouse using Redpanda's Wasm data transforms and Iceberg integration. You'll deploy a transform that performs true 1:N fanout, parsing batch messages and routing individual updates from a single input topic to multiple domain-specific output topics, each configured as an Iceberg table for analytics. | ||
|
|
||
| A Go producer sends complete JSON batches to an input topic. The transform dynamically registers Avro schemas in Schema Registry during initialization, then parses each batch, converts the data to Avro format with Schema Registry wire format encoding, and fans out messages to the appropriate Iceberg-enabled topics. Redpanda automatically validates messages against the Avro schemas and writes to Iceberg tables. |
There was a problem hiding this comment.
Not clear why we need Avro. Worth adding a sentence why we use both avro and json otherwise it is confusing. Fine to just say "to demo capabilities" if that's the intention.
Since the title is about fan-out we could focus only on fan-out and not mix technologies (json vs avro). But again, fine if you want to show more things. Just make it explicit so that the reader can follow the train of thought.
| This approach offers several advantages: | ||
|
|
||
| * No external ETL: routing and encoding happen inside Redpanda brokers using Wasm transforms. | ||
| * Messages include schema IDs for automatic validation and deserialization. |
There was a problem hiding this comment.
nit: what automatic validation means? we don't validate broker side. do you mean that we do that in transforms? i'd be explicit.
| * Operational simplicity: a single Redpanda cluster handles routing, encoding, validation, and storage. | ||
| * Iceberg tables are immediately queryable by Spark. | ||
|
|
||
| Consider this approach for use cases like: |
There was a problem hiding this comment.
Consider lifting this higher. Way before you start describing the solution.
| spark.sql("DESCRIBE lab.redpanda.customers").show() | ||
| ---- | ||
|
|
||
| NOTE: It may take a few seconds for data to appear in Iceberg tables after producing. Redpanda writes to Iceberg based on the `iceberg_target_lag_ms` setting (5 seconds in this lab). |
There was a problem hiding this comment.
and iceberg_catalog_commit_interval_ms
| │ | ||
| ▼ | ||
| ┌──────────────────────────────┐ | ||
| │ Spark / Jupyter │ |
There was a problem hiding this comment.
Needs an extra box for iceberg rest catalog itself. I'd put it on the same line as object storage as both iceberg-enabled topics (above) and spark/jupyter (below) communicate with both: object storage and iceberg catalog.
| @@ -0,0 +1,20 @@ | |||
| { | |||
There was a problem hiding this comment.
what are these used for?
| @@ -0,0 +1,7 @@ | |||
| jupyter==1.0.0 | |||
| spylon-kernel==0.4.1 | |||
| pyiceberg[pyarrow,duckdb,pandas]==0.7.1 | |||
There was a problem hiding this comment.
what is pyiceberg and duckdb used for?
| "output_type": "error", | ||
| "traceback": [ | ||
| "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", | ||
| "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", |
There was a problem hiding this comment.
This shouldn't contain exceptions imho when committed.
Either clean execution so that it has something to render on github which supports rendering .ipynb files or strip outputs with https://github.com/kynan/nbstripout before committing.
rockwotj
left a comment
There was a problem hiding this comment.
can look after Nicolae's feedback is addressed
Demonstrates 1:N fanout using Wasm transforms to route batched events to multiple domain-specific Iceberg topics. The transform parses batch messages and fans out individual updates, showcasing Redpanda's in-broker stream processing capabilities.
Key Features:
events)Data Flow:
10 batch messages → Transform parses → 30 individual messages (10 each to orders, inventory, customers topics)
To test:
git clone https://github.com/redpanda-data/redpanda-labs.git), check out the feature branch:Preview: https://deploy-preview-279--redpanda-labs-preview.netlify.app/redpanda-labs/data-transforms/iceberg-fanout-go/
Closes DOC-1965