Skip to content

Conversation

@manu-sj
Copy link
Contributor

@manu-sj manu-sj commented Nov 7, 2025

Issue:
Spark fails with org.apache.spark.sql.avro.IncompatibleSchemaException: Attempting to treat union as a RECORD, but it was: UNION when trying to deserialize an avro schema of the format '["null", {"type": "record", "name": "S_col_object_dict_", "fields": [{"name": "k0", "type": ["null", "long"]}, {"name": "k1", "type": ["null", "long"]}]}]'

Root Cause:
It seems that spark always expects the top level schema (jsonFormatSchema) provided to the from_avro function to be a record if it infers the type as a Struct Type.
From looking at the code it seem that spark always considers a union of null and a records as a Struct Type, and for struct Types spark uses the getRecordWriter function which expects the passed avro type to be a record and throws the exception.

Fix Done
Wrap the the union of struct into a record so that spark can deserialize it. This would result in a deseriialized struct of a struct which is then unnested to get the actual struct.

JIRA Issue: https://hopsworks.atlassian.net/browse/FSTORE-1897

Priority for Review: -

Related PRs: -

How Has This Been Tested?

  • Unit Tests
  • Integration Tests
  • Manual Tests on VM

Checklist For The Assigned Reviewer:

- [ ] Checked if merge conflicts with master exist
- [ ] Checked if stylechecks for Java and Python pass
- [ ] Checked if all docstrings were added and/or updated appropriately
- [ ] Ran spellcheck on docstring
- [ ] Checked if guides & concepts need to be updated
- [ ] Checked if naming conventions for parameters and variables were followed
- [ ] Checked if private methods are properly declared and used
- [ ] Checked if hard-to-understand areas of code are commented
- [ ] Checked if tests are effective
- [ ] Built and deployed changes on dev VM and tested manually
- [x] (Checked if all type annotations were added and/or updated appropriately)

@manu-sj manu-sj merged commit 4d4a30b into logicalclocks:main Nov 10, 2025
17 checks passed
jimdowling pushed a commit to jimdowling/hopsworks-api that referenced this pull request Nov 15, 2025
…ntains a struct (logicalclocks#715)

* initial changes to fix deserilizing structs using spark

* cleaning up a little bit

* adding comments

* adding test for deserializing structs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants