Normalize write schema casing to table casing at catalog layer (Make OH reads/writes case-insensitive)#558
Merged
cbb330 merged 4 commits intolinkedin:mainfrom May 2, 2026
Conversation
Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit
column names with different casing than what the table stores (e.g.
"id" vs "ID"). Because validateWriteSchema is case-sensitive, these
writes were rejected with InvalidSchemaEvolutionException even when the
field IDs matched, making case-insensitive writes impossible without
changing the table's existing column names.
Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level
column names to match the table schema's casing (matched by Iceberg
field ID, not by name) before any comparison or storage. The table's
existing casing is never mutated, and writers do not need to know or
match the exact casing stored in the catalog.
Tables where two or more top-level columns share the same case-folded
name (e.g. "id" and "ID") are excluded from normalization because the
target column would be ambiguous, so writes to such tables must still
use exact casing.
Testing:
- BaseIcebergSchemaValidatorTest: 10 unit tests covering
hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation,
including new-column passthrough and field attribute preservation.
- RepositoryTest (H2 Spring integration): 2 new tests exercising the
full save() path through the Spring context:
- testCaseInsensitiveWrite_succeeds_andPreservesTableCasing: creates
a table with "ID", saves an update with "id", asserts no exception
and that the stored schema still shows "ID".
- testCaseInsensitiveWrite_blockedForCaseDuplicateTable: creates a
table with case-duplicate columns, asserts that a mismatched-casing
save still throws InvalidSchemaEvolutionException.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cbb330
requested changes
Apr 24, 2026
Collaborator
cbb330
left a comment
There was a problem hiding this comment.
Previously some tests were run which show that writes with different casing already succeeds.
So I want to validate whats the existing behavior before applying a fix. to do that, can you reproduce the issue in a separate PR under spark e2e tests? similar to this:
Contributor
Author
cbb330
reviewed
Apr 27, 2026
…isit Replace the top-level-only loop implementations of hasCaseDuplicateFields and normalizeSchemaCasingToTable with TypeUtil.SchemaVisitor-based implementations that recurse through structs, list elements, and map key/value types at any depth. Key changes: - hasCaseDuplicateFields: uses TypeUtil.visit to detect case-duplicate sibling fields at any nesting level; correctly treats same-named fields in different structs as non-duplicates; adds Locale.ROOT to toLowerCase - normalizeSchemaCasingToTable: uses TypeUtil.indexById to build a single flat Map<Integer, NestedField> for the entire table schema in one O(n) walk, then TypeUtil.visit to rewrite field names at any depth; uses reference-equality short-circuit to avoid allocation for unchanged fields - Both methods use Iceberg's visitor contract so new Type variants cause a compile error rather than a silent passthrough Add tests covering nested struct, list element, and map value normalization as well as the hasCaseDuplicateFields depth cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce SchemaValidationUtil in the open-source repo so both open-source BaseIcebergSchemaValidator and li-openhouse share one duplicate-detection implementation instead of maintaining divergent subsets. Changes: - New SchemaValidationUtil with findDuplicateCaseInsensitiveColumnNames (returns conflict paths) and hasDuplicateCaseInsensitiveColumnNames (boolean), covering struct fields at any depth, list element types, and map key/value types; uses Locale.ROOT to fix the Turkish-i locale bug - Remove hasCaseDuplicateFields from BaseIcebergSchemaValidator; the normalization guard in OpenHouseInternalRepositoryImpl now calls SchemaValidationUtil.hasDuplicateCaseInsensitiveColumnNames directly - Move hasCaseDuplicateFields tests to SchemaValidationUtilTest; update the two integration-test precondition assertions in BaseIcebergSchemaValidatorTest to use the new utility The two callers retain opposite semantics from the same predicate: li-openhouse rejects writes when duplicates exist; the normalization guard skips normalization (writes may still succeed for exact-casing). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three gaps identified in the E2E and unit test suites: Path B (RepositoryTest): writer submits wrong casing on an existing column AND adds a new column in the same write. Normalization fixes the existing column; validateWriteSchema then sees a valid evolution and must accept it. This is the realistic migration scenario where a Spark/Trino client sends a schema update alongside a column addition. Path D (RepositoryTest): table has case-duplicate columns (guard skips normalization). A write with exactly matching casing must still succeed (sameSchema = true). Verifies the guard does not break legitimate writes to legacy case-duplicate tables. Unit (BaseIcebergSchemaValidatorTest): same as Path B but at the unit level — normalize "id"→"ID", then confirm validateWriteSchema accepts the resulting evolution (new column appended, existing IDs intact). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cbb330
approved these changes
May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Writers (DaliSpark, Spark SQL, Trino DML, Java Iceberg API) may submit column names with different casing than what the table stores (e.g. "id" vs "ID"). Because validateWriteSchema is case-sensitive, these writes were rejected with InvalidSchemaEvolutionException even when the field IDs matched, making case-insensitive writes impossible without changing the table's existing column names.
Changes
Fix: in doUpdateSchemaIfNeeded, normalize the write schema's top-level column names to match the table schema's casing (matched by Iceberg field ID, not by name) before any comparison or storage. The table's existing casing is never mutated, and writers do not need to know or match the exact casing stored in the catalog.
Tables where two or more top-level columns share the same case-folded name (e.g. "id" and "ID") are excluded from normalization because the target column would be ambiguous, so writes to such tables must still use exact casing.
Testing Done
Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.
BaseIcebergSchemaValidatorTest: 10 unit tests covering hasCaseDuplicateFields and normalizeSchemaCasingToTable in isolation, including new-column passthrough and field attribute preservation.
RepositoryTest (H2 Spring integration): 2 new tests exercising the full save() path through the Spring context:
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.