Add avro schema passthrough by jogrogan · Pull Request #211 · linkedin/Hoptimator

jogrogan · 2026-04-22T21:50:26Z

Summary

For the record I don't love this solution... but I'll explain why it seems necessary down below...

This change exposes the source-of-truth Avro schemas from Hoptimator tables, replacing lossy RelDataType round-tripping for the consumers that care about Avro-level fidelity.

Adds a new AvroSchemaProvider interface that Calcite Tables can implement to surface their key and value Avro schemas directly. Three consumers are wired up to use it, each falling back to the existing synthesis behavior when a provider isn't available:

K8sConnector — new {{avroValueSchema}} template variable for connector payload options (e.g. a Flink connector's default.mode.payload). Renders the upstream's native value schema when present; synthesizes from the flat row type otherwise.
HoptimatorConnection.resolve() / !resolve CLI — returns the merged view (value + KEY_-prefixed key fields) so SQL queries and the Flink catalog see keys as columns.
HoptimatorJdbcTable implements the interface by peeking through the upstream JDBC DataSource to the underlying Calcite table.

VeniceStore is the first provider implementation: valueSchema() returns the raw value schema from StoreSchemaFetcher, keySchema() returns the raw key schema.

Why

AvroConverter.rel → DataTypeUtils.flatten → JDBC → DataTypeUtils.unflatten → AvroConverter.avro strips too much: namespaces get synthesized from field paths, nested record identities are invented per call site, reused record definitions get duplicated, field-level props/defaults/aliases are dropped. For consumers that want to hand the Avro schema to another Avro-aware system, none of that survives. Internally we run into schema incompatibility issues when dealing with Flink. Flink connectors and the Proteus Flink catalog need to be aware of the real table Avro schema. This change gives those consumers a direct path to the source schema while leaving the RelDataType round-trip in place for the SQL query layer that depends on the flattening.

Now why I don't love this. I don't love this change strictly because we need to cross the JDBC boundary which is now providing a backdoor into the underlying drivers, so this begs the question, do we really need those drivers?

Changes

New — hoptimator-avro

AvroSchemaProvider interface with valueSchema() (the payload) and keySchema() (the key, or null when the table has no distinct key concept).
AvroSchemas utility class with:
- KEY_PREFIX / PRIMITIVE_KEY_NAME constants (the Hoptimator-wide convention for merging keys into value schemas).
- cloneField(name, Schema.Field) — produces an unowned Schema.Field clone preserving name, type reference, doc, default, sort order, aliases, and custom properties (Avro rejects Fields already owned by another record via its internal position guard).
- mergeKeyIntoValue(keySchema, valueSchema, keyPrefix, primitiveKeyName) — produces a merged record inheriting the value schema's identity (namespace, name, doc, aliases, record-level props). Struct keys contribute prefixed fields; primitive keys contribute a single named field.
- mergedAvroSchemaFor(AvroSchemaProvider) — applies the Hoptimator convention (KEY_ prefix / KEY field). Used by resolve().

Implementations

VeniceStore implements AvroSchemaProvider.

Consumers

K8sConnector: new {{avroValueSchema}} template variable. Prefers provider-supplied value schema; falls back to AvroConverter.avro(rowType).
HoptimatorConnection.resolve(): uses AvroSchemas.mergedAvroSchemaFor when a provider is available; existing synthesis otherwise.
HoptimatorJdbcTable: implements AvroSchemaProvider by unwrapping the upstream JDBC DataSource to a CalciteConnection, walking to the real upstream table, and delegating.

Behavior changes

{{avroValueSchema}} template variable: new. No existing templates referenced it.
HoptimatorConnection.resolve() / !resolve CLI: when the resolved table implements AvroSchemaProvider, the returned schema is the merged view from the source (with source-level namespaces, nested record identities, and properties preserved) rather than the synthesized one. Shape is the same (value fields + KEY_-prefixed key fields), only the fidelity improves. Falls back to existing behavior for non-providers.

Test plan

./gradlew build green locally (all 195 tasks).
AvroSchemasTest (10 tests) — covers cloneField (position guard, doc/default/order/aliases/props preservation), mergeKeyIntoValue (struct key, primitive key, nested-record namespace preservation, reused-record single-definition serialization, record-level identity/props inheritance, parse round-trip, non-record-value error), and mergedAvroSchemaFor (no-key short-circuit, key-present merge).
VeniceStoreTest — new tests for valueSchema() returning the raw payload, keySchema() returning raw key (struct and primitive), valueSchemaId path, end-to-end mergedAvroSchemaFor integration.
HoptimatorConnectionTest — 4 new tests for providerSchemaAt (unknown path, non-provider, no-key provider, struct-key and primitive-key merged output).
HoptimatorJdbcTableTest — delegation tests for valueSchema / keySchema (null upstream, non-provider upstream, dual delegation).
K8sConnectorTest — provider-path and fallback-path tests for {{avroValueSchema}} template rendering.

github-actions · 2026-04-22T21:58:43Z

Code Coverage

Overall Project	84.28% `-0.22%`	🟢
Files changed	84.3%	🟢

File	Coverage
AvroSchemas.java	100%	🟢
VeniceStore.java	100%	🟢
KafkaTopic.java	100%	🟢
AvroConverter.java	97.41%	🟢
K8sConnector.java	97.03% `-1.08%`	🟢
MySqlDeployer.java	96.84%	🟢
HoptimatorConnection.java	89.59% `-0.55%`	🟢
HoptimatorJdbcTable.java	42.48% `-49.67%`	❌
AvroSchemaSource.java	0%	❌

ryannedolan · 2026-04-30T18:42:57Z

Don't love "Provider" being overloaded here, but otherwise lgtm.

jogrogan · 2026-05-01T14:38:05Z

Don't love "Provider" being overloaded here, but otherwise lgtm.

@ryannedolan Renamed AvroSchemaProvider -> AvroSchemaSource, let me know what you think

Add avro schema passthrough

827706f

Merge branch 'main' into jogrogan/avroSchemaGeneration

16ef233

Rename AvroSchemaProvider

7bb3583

Merge branch 'main' into jogrogan/avroSchemaGeneration

fcc49fd

ryannedolan approved these changes May 1, 2026

View reviewed changes

jogrogan merged commit dd0cdaf into main May 1, 2026
1 check passed

jogrogan deleted the jogrogan/avroSchemaGeneration branch May 1, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add avro schema passthrough#211

Add avro schema passthrough#211
jogrogan merged 4 commits intomainfrom
jogrogan/avroSchemaGeneration

jogrogan commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

ryannedolan commented Apr 30, 2026

Uh oh!

jogrogan commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jogrogan commented Apr 22, 2026

Summary

Why

Changes

Behavior changes

Test plan

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage

Uh oh!

ryannedolan commented Apr 30, 2026

Uh oh!

jogrogan commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 22, 2026 •

edited

Loading