Skip to content

New lab for multi-topic fanout to Iceberg tables#279

Open
kbatuigas wants to merge 10 commits intomainfrom
iceberg-multi-topic-fanout-using-transforms
Open

New lab for multi-topic fanout to Iceberg tables#279
kbatuigas wants to merge 10 commits intomainfrom
iceberg-multi-topic-fanout-using-transforms

Conversation

@kbatuigas
Copy link
Copy Markdown
Contributor

@kbatuigas kbatuigas commented Feb 20, 2026

Demonstrates 1:N fanout using Wasm transforms to route batched events to multiple domain-specific Iceberg topics. The transform parses batch messages and fans out individual updates, showcasing Redpanda's in-broker stream processing capabilities.

Key Features:

  • Producer: Sends complete JSON batches to input topic (events)
  • Transform: Parses batches and performs 1:N fanout (1 batch → 3 individual messages)
    • Fanout Logic: Transform extracts individual updates and routes to target topics
    • Schema Validation: Avro validation via Schema Registry
  • Iceberg Integration: Automatic writes to queryable Iceberg tables
  • Analytics: Spark/Jupyter notebooks for querying routed data

Data Flow:
10 batch messages → Transform parses → 30 individual messages (10 each to orders, inventory, customers topics)

To test:

  1. After cloning the repo (git clone https://github.com/redpanda-data/redpanda-labs.git), check out the feature branch:
cd redpanda-labs && git checkout iceberg-multi-topic-fanout-using-transforms
  1. Change into the lab directory:
cd data-transforms/go/iceberg-fanout
  1. Then continue with the rest of the lab ("Set environment variables for Redpanda and Console versions..." etc.)

Preview: https://deploy-preview-279--redpanda-labs-preview.netlify.app/redpanda-labs/data-transforms/iceberg-fanout-go/

Closes DOC-1965

Demonstrates routing batched events to multiple domain-specific Iceberg topics using Wasm transforms. Includes Go producer with franz-go, transform with header-based routing, JSON Schema validation, and Spark/Jupyter for querying Iceberg tables.

Known issue: Iceberg tables not being created. Redpanda logs show "type_resolver::errc::registry_error" when trying to write to Iceberg topics configured with value_schema_latest mode. Messages route correctly to all topics and JSON schemas are properly registered, but Iceberg integration fails during schema resolution step.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 20, 2026

Deploy Preview for redpanda-labs-preview ready!

Name Link
🔨 Latest commit 4c174d4
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-labs-preview/deploys/699f876fb3532700088d7a1b
😎 Deploy Preview https://deploy-preview-279--redpanda-labs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kbatuigas kbatuigas changed the title Add multi-topic fanout to Iceberg tables lab New lab for multi-topic fanout to Iceberg tables Feb 20, 2026
@kbatuigas kbatuigas requested a review from rockwotj February 26, 2026 20:36
:env-docker: true
:page-layout: lab
:page-categories: Data Transforms, Iceberg, Schema Registry
:description: Route batched events to multiple Iceberg-enabled topics using Wasm transforms with Avro encoding and Schema Registry wire format, creating a streaming data lakehouse pipeline.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • is the wowrd "batched" necessary here?
  • "multiple Iceberg-enabled topics" that's an implementation detail so maybe drop it from the description and replace with "multiple iceberg tables" (as the title says) and rather introduce later that "redpanda supports at most one iceberg table per topic so to route to multiple table we are going to use transforms to fanout". Ohterwise, it is not clear why fanout is important, why transforms are necessary, how this lab is different from the other iceberg lab we already have.


== Overview

This lab demonstrates how to build a streaming data lakehouse using Redpanda's Wasm data transforms and Iceberg integration. You'll deploy a transform that performs true 1:N fanout, parsing batch messages and routing individual updates from a single input topic to multiple domain-specific output topics, each configured as an Iceberg table for analytics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above (https://github.com/redpanda-data/redpanda-labs/pull/279/changes#r2873829510). I believe the overview should state the problem and why this lab is worth looking into.


This lab demonstrates how to build a streaming data lakehouse using Redpanda's Wasm data transforms and Iceberg integration. You'll deploy a transform that performs true 1:N fanout, parsing batch messages and routing individual updates from a single input topic to multiple domain-specific output topics, each configured as an Iceberg table for analytics.

A Go producer sends complete JSON batches to an input topic. The transform dynamically registers Avro schemas in Schema Registry during initialization, then parses each batch, converts the data to Avro format with Schema Registry wire format encoding, and fans out messages to the appropriate Iceberg-enabled topics. Redpanda automatically validates messages against the Avro schemas and writes to Iceberg tables.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why we need Avro. Worth adding a sentence why we use both avro and json otherwise it is confusing. Fine to just say "to demo capabilities" if that's the intention.

Since the title is about fan-out we could focus only on fan-out and not mix technologies (json vs avro). But again, fine if you want to show more things. Just make it explicit so that the reader can follow the train of thought.

This approach offers several advantages:

* No external ETL: routing and encoding happen inside Redpanda brokers using Wasm transforms.
* Messages include schema IDs for automatic validation and deserialization.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what automatic validation means? we don't validate broker side. do you mean that we do that in transforms? i'd be explicit.

* Operational simplicity: a single Redpanda cluster handles routing, encoding, validation, and storage.
* Iceberg tables are immediately queryable by Spark.

Consider this approach for use cases like:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider lifting this higher. Way before you start describing the solution.

spark.sql("DESCRIBE lab.redpanda.customers").show()
----

NOTE: It may take a few seconds for data to appear in Iceberg tables after producing. Redpanda writes to Iceberg based on the `iceberg_target_lag_ms` setting (5 seconds in this lab).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and iceberg_catalog_commit_interval_ms

┌──────────────────────────────┐
│ Spark / Jupyter │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an extra box for iceberg rest catalog itself. I'd put it on the same line as object storage as both iceberg-enabled topics (above) and spark/jupyter (below) communicate with both: object storage and iceberg catalog.

@@ -0,0 +1,20 @@
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these used for?

@@ -0,0 +1,7 @@
jupyter==1.0.0
spylon-kernel==0.4.1
pyiceberg[pyarrow,duckdb,pandas]==0.7.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is pyiceberg and duckdb used for?

"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't contain exceptions imho when committed.
Either clean execution so that it has something to render on github which supports rendering .ipynb files or strip outputs with https://github.com/kynan/nbstripout before committing.

Copy link
Copy Markdown
Contributor

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can look after Nicolae's feedback is addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants