Skip to content

Add avro schema passthrough#211

Merged
jogrogan merged 4 commits intomainfrom
jogrogan/avroSchemaGeneration
May 1, 2026
Merged

Add avro schema passthrough#211
jogrogan merged 4 commits intomainfrom
jogrogan/avroSchemaGeneration

Conversation

@jogrogan
Copy link
Copy Markdown
Collaborator

Summary

For the record I don't love this solution... but I'll explain why it seems necessary down below...

This change exposes the source-of-truth Avro schemas from Hoptimator tables, replacing lossy RelDataType round-tripping for the consumers that care about Avro-level fidelity.

Adds a new AvroSchemaProvider interface that Calcite Tables can implement to surface their key and value Avro schemas directly. Three consumers are wired up to use it, each falling back to the existing synthesis behavior when a provider isn't available:

  • K8sConnector — new {{avroValueSchema}} template variable for connector payload options (e.g. a Flink connector's default.mode.payload). Renders the upstream's native value schema when present; synthesizes from the flat row type otherwise.
  • HoptimatorConnection.resolve() / !resolve CLI — returns the merged view (value + KEY_-prefixed key fields) so SQL queries and the Flink catalog see keys as columns.
  • HoptimatorJdbcTable implements the interface by peeking through the upstream JDBC DataSource to the underlying Calcite table.

VeniceStore is the first provider implementation: valueSchema() returns the raw value schema from StoreSchemaFetcher, keySchema() returns the raw key schema.

Why

AvroConverter.relDataTypeUtils.flatten → JDBC → DataTypeUtils.unflattenAvroConverter.avro strips too much: namespaces get synthesized from field paths, nested record identities are invented per call site, reused record definitions get duplicated, field-level props/defaults/aliases are dropped. For consumers that want to hand the Avro schema to another Avro-aware system, none of that survives. Internally we run into schema incompatibility issues when dealing with Flink. Flink connectors and the Proteus Flink catalog need to be aware of the real table Avro schema. This change gives those consumers a direct path to the source schema while leaving the RelDataType round-trip in place for the SQL query layer that depends on the flattening.

Now why I don't love this. I don't love this change strictly because we need to cross the JDBC boundary which is now providing a backdoor into the underlying drivers, so this begs the question, do we really need those drivers?

Changes

New — hoptimator-avro

  • AvroSchemaProvider interface with valueSchema() (the payload) and keySchema() (the key, or null when the table has no distinct key concept).
  • AvroSchemas utility class with:
    • KEY_PREFIX / PRIMITIVE_KEY_NAME constants (the Hoptimator-wide convention for merging keys into value schemas).
    • cloneField(name, Schema.Field) — produces an unowned Schema.Field clone preserving name, type reference, doc, default, sort order, aliases, and custom properties (Avro rejects Fields already owned by another record via its internal position guard).
    • mergeKeyIntoValue(keySchema, valueSchema, keyPrefix, primitiveKeyName) — produces a merged record inheriting the value schema's identity (namespace, name, doc, aliases, record-level props). Struct keys contribute prefixed fields; primitive keys contribute a single named field.
    • mergedAvroSchemaFor(AvroSchemaProvider) — applies the Hoptimator convention (KEY_ prefix / KEY field). Used by resolve().

Implementations

  • VeniceStore implements AvroSchemaProvider.

Consumers

  • K8sConnector: new {{avroValueSchema}} template variable. Prefers provider-supplied value schema; falls back to AvroConverter.avro(rowType).
  • HoptimatorConnection.resolve(): uses AvroSchemas.mergedAvroSchemaFor when a provider is available; existing synthesis otherwise.
  • HoptimatorJdbcTable: implements AvroSchemaProvider by unwrapping the upstream JDBC DataSource to a CalciteConnection, walking to the real upstream table, and delegating.

Behavior changes

  • {{avroValueSchema}} template variable: new. No existing templates referenced it.
  • HoptimatorConnection.resolve() / !resolve CLI: when the resolved table implements AvroSchemaProvider, the returned schema is the merged view from the source (with source-level namespaces, nested record identities, and properties preserved) rather than the synthesized one. Shape is the same (value fields + KEY_-prefixed key fields), only the fidelity improves. Falls back to existing behavior for non-providers.

Test plan

  • ./gradlew build green locally (all 195 tasks).
  • AvroSchemasTest (10 tests) — covers cloneField (position guard, doc/default/order/aliases/props preservation), mergeKeyIntoValue (struct key, primitive key, nested-record namespace preservation, reused-record single-definition serialization, record-level identity/props inheritance, parse round-trip, non-record-value error), and mergedAvroSchemaFor (no-key short-circuit, key-present merge).
  • VeniceStoreTest — new tests for valueSchema() returning the raw payload, keySchema() returning raw key (struct and primitive), valueSchemaId path, end-to-end mergedAvroSchemaFor integration.
  • HoptimatorConnectionTest — 4 new tests for providerSchemaAt (unknown path, non-provider, no-key provider, struct-key and primitive-key merged output).
  • HoptimatorJdbcTableTest — delegation tests for valueSchema / keySchema (null upstream, non-provider upstream, dual delegation).
  • K8sConnectorTest — provider-path and fallback-path tests for {{avroValueSchema}} template rendering.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Code Coverage

Overall Project 84.28% -0.22% 🟢
Files changed 84.3% 🟢

File Coverage
AvroSchemas.java 100% 🟢
VeniceStore.java 100% 🟢
KafkaTopic.java 100% 🟢
AvroConverter.java 97.41% 🟢
K8sConnector.java 97.03% -1.08% 🟢
MySqlDeployer.java 96.84% 🟢
HoptimatorConnection.java 89.59% -0.55% 🟢
HoptimatorJdbcTable.java 42.48% -49.67%
AvroSchemaSource.java 0%

@ryannedolan
Copy link
Copy Markdown
Collaborator

Don't love "Provider" being overloaded here, but otherwise lgtm.

@jogrogan
Copy link
Copy Markdown
Collaborator Author

jogrogan commented May 1, 2026

Don't love "Provider" being overloaded here, but otherwise lgtm.

@ryannedolan Renamed AvroSchemaProvider -> AvroSchemaSource, let me know what you think

@jogrogan jogrogan merged commit dd0cdaf into main May 1, 2026
1 check passed
@jogrogan jogrogan deleted the jogrogan/avroSchemaGeneration branch May 1, 2026 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants