Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 19, 2025

Which issue does this PR close?

Rationale for this change

This is needed to read/write the Variant Parquet logical type and work with the rest of the Arrow Ecosystem

Note, this is broken out the larger PR here:

We need a way to write Variant encoded data to/from parquet, and the current way the VariantArray is implemented doesn't work (panics when writing to parquet). See tickets above

Instead of a impl Array it seems the better way to do this is using an Arrow Extension Type. See #8319 (comment) for more details

What changes are included in this PR?

  1. remove the Array impl for VariantArray, which forces explict conversions back/forth when reading/writing
  2. remove the Array impl for ShreddedVariantFieldArray, which forces explicit conversions back/forth when reading/writing
  3. Add VariantType extension type

Are these changes tested?

Yes, with new code and tests

Are there any user-facing changes?

Yes, but this is not yet stable / released, so these changes have no impact on the releasability of this code

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 19, 2025
use std::any::Any;
use std::sync::Arc;

/// Arrow Variant [`ExtensionType`].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new canonical variant type

}

/// An array of Parquet [`Variant`] values
///
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update these docstrings to explain what is going on here and how to use the extension types

pub fn try_new(inner: &dyn Array) -> Result<Self, ArrowError> {
// Workaround lack of support for Binary
// https://github.com/apache/arrow-rs/issues/8387
let inner = cast_to_binary_view_arrays(inner)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast is unfortunate, but I think it is the best (simplest) solution while we work on the real solution

}

/// Returns a new DataType representing this VariantArray's inner type
pub fn data_type(&self) -> &DataType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods are just moved from the Array impl

};

// Extract value and typed_value fields (metadata is not expected in ShreddedVariantFieldArray)
let value = inner_struct
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is moved to ShreddingState

/// let struct_array: StructArray = get_struct_array();
/// let shredding_state = ShreddingState::try_from(&struct_array).unwrap();
/// ```
pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now infallable per @scovich 's suggestion #8365 (comment)

}
}

impl Array for VariantArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved up in the file

pub(crate) enum ShreddedPathStep {
/// Path step succeeded, return the new shredding state
Success(&'a ShreddingState),
Success(ShreddingState),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only part I am a little sad about -- that it now requires owing the ShreddingState which requires clone the state. While this is typically just a few Arcs it is still unfortunate

However, unless we change VariantArray to be reference based (e.g. VariantArray<'a>) I can't think of any way around it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if somebody is concerned about cloning ArrayRef like it's free... arrow-rs might be the wrong project for them to look very closely at?

Some potential ideas:

  1. Have the caller of variant_get pass an owned VariantArray, and then deconstruct it using some kind of into_parts method? Caller anyway had to clone the various innards of the &dyn Array it would have started out with? But that has the distinct disadvantage of producing a logically read-only operation that consumes its input.
  2. If we're ok with VariantArray being an exclusively local thing (i.e. never returned from functions), we could make both it and ShreddingState take references to their various arrays. That would produce VariantArray<'a>, ShreddingState<'a>, etc.
  3. Make ShreddingState track Option<Cow<...>> internally, which might solve most of the ergonomics issues of 2/ without overly polluting the public API. The only downside is a generic lifetime parameter: VariantArray<'a>, ShreddingState<'a>, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BinaryViewArrays also have a Vec that will get cloned. However, I think it is ok to clone for now, as the path / cloning happens once per array and so is likely amortized over a batch.

I tried option 1 briefly but I couldn't get it to work.

I'll let benchmarking be our guide and if we see problems with this clone, then perhaps we can pursue options 2 or 3.

"expected a VariantArray as the input for variant_get".to_owned(),
)
})?;
let variant_array = VariantArray::try_new(input)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will admit the new try_new is quite a bit more ergonomic

.column_by_name("var")
.unwrap_or_else(|| panic!("No 'var' column found in parquet file {path:?}"));

// the values are read as
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is moved to VariantArray::try_new

@alamb alamb marked this pull request as ready for review September 19, 2025 17:50
@alamb alamb requested a review from Copilot September 19, 2025 17:50
@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2025

@scovich @klion26 and @codephage2020 this PR is ready for review

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the Array implementation from VariantArray and ShreddedVariantFieldArray and introduces a new VariantType extension type to properly represent Variant data in the Arrow ecosystem. This change enables proper integration with Parquet I/O operations which previously failed due to implementation issues.

  • Removes Array trait implementations that were causing Parquet integration problems
  • Introduces VariantType extension type following Arrow conventions for canonical representation
  • Updates all code to use explicit conversions between VariantArray and ArrayRef

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
parquet-variant-compute/src/variant_array.rs Implements VariantType extension type, removes Array trait impls, adds conversion methods
parquet-variant-compute/src/variant_get.rs Updates to use explicit VariantArray conversions instead of downcasting
parquet-variant-compute/src/variant_to_arrow.rs Updates return type construction for VariantArray
parquet-variant-compute/src/variant_array_builder.rs Updates VariantArray construction call
parquet-variant-compute/src/lib.rs Exports new VariantType
parquet/tests/variant_integration.rs Removes casting workaround code that's now handled internally
parquet-variant-compute/benches/variant_kernels.rs Updates array conversion

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

inner: inner.clone(),
metadata: metadata.clone(),
shredding_state: ShreddingState::try_new(value, typed_value)?,
shredding_state: ShreddingState::new(value, typed_value),
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ShreddingState::new method is called here but the old try_new method was removed. Consider adding error handling if ShreddingState::new can potentially fail, or document why this is safe by construction.

Copilot uses AI. Check for mistakes.

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2025

After merging up from #8366, a new shred_object test is now failing:

called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Found unmasked nulls for non-nullable StructArray field \"score\"")

I don't fully understand what this means and haven't had a chance to debug it. @scovich any ideas?

@scovich
Copy link
Contributor

scovich commented Sep 19, 2025

After merging up from #8366, a new shred_object test is now failing:

called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Found unmasked nulls for non-nullable StructArray field \"score\"")

I don't fully understand what this means and haven't had a chance to debug it. @scovich any ideas?

The error means that some non-nullable column has null mask bits that are not covered by any ancestor column.

The fields of a shredded variant object are not nullable -- NULL is encoded by the child value and typed_value both being NULL. So it's possible that something is wrongly marking a row of some shredded object field as NULL, even tho the path to that value is non-NULL?

It's also possible that some struct (or variant array) is being wrongly constructed with some non-nullable fields that should actually be nullable?

@scovich
Copy link
Contributor

scovich commented Sep 19, 2025

The offending array is below.

I notice a couple weird things:

  1. Looking at the test itself, the top-level null mask should only have one NULL, at index 6. I don't know how index 5, 8, and 9 are also NULL??
  2. The null masks for score are also weird.
validity: [
  valid,
  valid,
  valid,
  valid,
  null,
  null,
  valid,
  null,
  null,
]

score: StructArray
-- validity:
[
  valid,
  valid,
  null,
  valid,
  null,
  null,
  null,
  null,
  valid,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
  null,
  [45, 110, 105, 110, 101, 116, 121, 45, 102, 105, 118, 101],
  null,
  null,
  null,
  null,
  null,
]
-- child 1: "typed_value" (Float64)
PrimitiveArray<Float64>
[
  95.5,
  87.2,
  null,
  null,
  null,
  null,
  null,
  null,
  66.67,
]
]

age: StructArray
-- validity:
[
  valid,
  valid,
  valid,
  valid,
  null,
  null,
  null,
  null,
  null,
]
[
-- child 0: "value" (BinaryView)
BinaryViewArray
[
  null,
  null,
  null,
  [25, 116, 104, 105, 114, 116, 121],
  null,
  null,
  null,
  null,
  null,
]
-- child 1: "typed_value" (Int64)
PrimitiveArray<Int64>
[
  30,
  25,
  35,
  null,
  null,
  null,
  null,
  null,
  null,
]
]

@scovich
Copy link
Contributor

scovich commented Sep 20, 2025

@alamb -- Somehow you've managed to trigger a latent bug in my variant shredding code. I have no idea how the test ever passed before, nor why your PR should have anything to do with anything. But here's the fix that I confirmed unbreaks your PR when I stacked them both locally:

alamb pushed a commit that referenced this pull request Sep 20, 2025
# Which issue does this PR close?

- Fast-follow for #8366
- Related to #8392

# Rationale for this change

Somehow, #8392 exposes a latent
bug in #8366, which has improper
NULL handling for shredded object fields. The shredding PR originally
attempted to handle this case, but somehow the test did not trigger the
bug and so the (admittedly incomplete) code was removed. See
#8366 (comment). To
be honest, I have no idea how the original ever worked correctly, nor
why the new PR is able to expose the problem.

# What changes are included in this PR?

When used as a top-level builder,
`VariantToShreddedVariantRowBuilder::append_null` must append NULL to
its own `NullBufferBuilder`; but when used as a shredded object field
builder, it must append non-NULL. Plumb a new `top_level` parameter
through the various functions and into the two sub-builders it relies
on, so they can implement the correct semantics.

# Are these changes tested?

In theory, yes (I don't know how the object shredding test ever passed).
And it fixes the breakage in
#8392.

# Are there any user-facing changes?

No
@alamb
Copy link
Contributor Author

alamb commented Sep 20, 2025

@alamb -- Somehow you've managed to trigger a latent bug in my variant shredding code. I have no idea how the test ever passed before, nor why your PR should have anything to do with anything. But here's the fix that I confirmed unbreaks your PR when I stacked them both locally:

Thanks! I merged that in and merged up from main

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The only open question was about shredding state cloning, but that doesn't affect correctness AFAIK

Comment on lines 59 to 60
// Note don't check for metadata/value fields here because they may be
// absent in shredded variants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound quite right -- AFAIK the extension type should only annotate the top-level variant. Any shredded object fields or array elements nested inside should be identified according to spec, because at that point we know we're traversing a shredded variant (by virtue of having stepped into the typed_value column of a VariantArray. If anything looks out of place when trying to step into a shredded struct field or array element, that's just plain an error. No annotations needed to figure that out.

Even if, for some reason, we decide it's important not to require metadata, we should still enforce that the only fields present are some subset of metadata, value and typed_value, where the first two are binary. TBD whether we recurse into typed_value to verify the rest of the structure -- we probably should, but that could become expensive if people are converting from ArrayRef to VariantArray to ArrayRef a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good -- removed the comments

}
/// Is the element at index null?
pub fn is_null(&self, index: usize) -> bool {
self.nulls().map(|n| n.is_null(index)).unwrap_or_default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just

Suggested change
self.nulls().map(|n| n.is_null(index)).unwrap_or_default()
self.nulls().map_or(false, |n| n.is_null(index))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and in fact when I did that clippy told me it could be simplified further to

        self.nulls().is_some_and( |n| n.is_null(index))

alamb and others added 2 commits September 20, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] VariantArray::data_type returns StructType, causing Array::as_struct to panic [Variant] writing a VariantArray to parquet panics
2 participants