Add ms.hdfs.reader.parse.json.strings for inlining JSON strings#93
Merged
avinas-kumar merged 6 commits intolinkedin:masterfrom Apr 30, 2026
Merged
Conversation
HdfsReader.selectFieldsFromGenericRecord serializes Avro string fields as JSON string primitives, which double-encodes pre-serialized JSON payloads (e.g. JSON-LD whose @-prefixed names are not valid Avro identifiers). When the new property is true, string values that parse as JSON objects or arrays are inlined via JsonParser.parseString; parse failures fall back to the existing primitive behavior. Default is false, so existing jobs are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DIL pins Gson 2.6.2 (gradle/scripts/dependencyDefinitions.gradle:31), which doesn't have the static JsonParser.parseString(String) introduced in Gson 2.8.6. Switch to the instance method new JsonParser().parse(s), which has identical semantics on this version and throws the same JsonSyntaxException on malformed input. Caught by local test run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace new JsonParser().parse(decrypted) with gson.fromJson(decrypted, JsonElement.class). Both calls go through the same Gson parser and throw the same JsonSyntaxException on malformed input, but fromJson matches the style of the adjacent ARRAY/RECORD branches, reuses the existing gson field instead of allocating a JsonParser per record, and avoids the @deprecated marker on JsonParser's instance method in newer Gson versions. All 8 HdfsReaderTest cases continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gautamshanu
reviewed
Apr 27, 2026
gautamshanu
reviewed
Apr 27, 2026
…hema path Two review fixes from gautamshanu: 1. Add MSTAGE_HDFS_READER_PARSE_JSON_STRINGS to the allProperties list in PropertyCollection so it participates in startup validation alongside every other MSTAGE_* property. 2. Add a test that uses a nullable string schema (Avro UNION ["string", "null"]) to exercise the UNION branch of selectFieldsFromGenericRecord. Existing tests covered Schema.Type.STRING; production traffic for the target use case is UNION-typed, so this widens coverage to that path. All 9 HdfsReaderTest cases pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gautamshanu
approved these changes
Apr 27, 2026
avinas-kumar
approved these changes
Apr 29, 2026
- Rename `looksLikeJson` to `isValidJson` for clarity. - Rename parameter `s` to `value` and local `t` to `trimmed`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buildscript classpath used `latest.release`, which now resolves to 6.0.4. JFrog 6.0.0+ depends on jackson-databind:2.15.4, transitively pulling jackson-core:2.15.4 — a multi-release JAR containing Java 17 (major version 61) class files. Gradle 6.8.1's classpath instrumenter walks every class via a bundled ASM that does not understand Java 17 bytecode and fails with: Failed to create Jar file ~/.gradle/caches/jars-8/<hash>/jackson-core-2.15.4.jar Caused by: java.lang.IllegalArgumentException: Unsupported class file major version 61 5.2.5 is the last release on jackson-databind:2.14.1, which is Java 8 compatible and configures cleanly under Gradle 6.8.1.
gautamshanu
reviewed
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
HdfsReader.selectFieldsFromGenericRecordserializes Avrostringfields as JSON string primitives viaJsonObject.addProperty, which double-encodes pre-serialized JSON payloads when the output is used as an outbound HTTP request body. TheARRAYandRECORDbranches already inline viagson.fromJson(...)+jsonObject.add(...); theSTRING/UNIONbranch has no equivalent.Use case
JSON-LD payloads use
@context/@typefield names, which are not valid Avro identifiers, so a structured Avro record cannot represent them. The workaround is to store a pre-serialized JSON-LD document as astringfield in Avro, but DIL currently escapes such strings when emitting to an HTTP POST body, producing malformed payloads.Change
Adds a new boolean property
ms.hdfs.reader.parse.json.strings(defaultfalse). When enabled, theSTRING/UNIONbranch ofselectFieldsFromGenericRecordattempts to parse string values that look like JSON ({…}or[…]) viagson.fromJson(..., JsonElement.class)and inlines the result as aJsonElement. Parse failures fall back to the existing string-primitive behavior.The implementation matches the style of the adjacent
ARRAYandRECORDbranches, which already usegson.fromJson(..., JsonArray.class)andgson.fromJson(..., JsonObject.class). Using the polymorphicJsonElement.classhere lets the new branch handle both object-rooted and array-rooted JSON content without pre-classifying the input.Back-compatibility
false; existing jobs are unaffected.Tests
Added 8 TestNG cases in
HdfsReaderTest(cdi-core/src/test/java/com/linkedin/cdi/util/HdfsReaderTest.java) usingWhitebox.invokeMethodfor the private method, matching the pattern inAvroExtractorTest:Flag on:
JsonObject@context) → inlined asJsonArrayJsonPrimitive[{incomplete]) → falls back toJsonPrimitivenullvalue →JsonNullJsonPrimitiveFlag off (back-compat gate):
JsonPrimitive(the regression the flag must not cause)All 8 tests pass locally on JDK 1.8 + Gradle 6.8.1 (matches CI environment).
Files changed
cdi-core/src/main/java/com/linkedin/cdi/configuration/PropertyCollection.java— registerMSTAGE_HDFS_READER_PARSE_JSON_STRINGScdi-core/src/main/java/com/linkedin/cdi/util/HdfsReader.java— conditional inlining branch +looksLikeJsonhelpercdi-core/src/test/java/com/linkedin/cdi/util/HdfsReaderTest.java— new test classdocs/parameters/ms.hdfs.reader.parse.json.strings.md— new property doc pageDocumentation
New page at docs/parameters/ms.hdfs.reader.parse.json.strings.md describing behavior, default, and an example contrasting flag-off vs flag-on output.
E2E test validation
PR in guest-workflows-spark for snapshot test, where we pointed to the snapshot version of dil-internal and passed the new flag as true
https://github.com/linkedin-multiproduct/guest-workflows-spark/pull/362
verified the DAG ran successfully in guest-workflows-spark with the new flag passed as true
confirmed from Google POC they are able to see JSON object from snapshot test