-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Arrow Variant Extension Type, remove Array
impl for VariantArray
and ShreddedVariantFieldArray
#8392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
750ac8a
to
bb11814
Compare
use std::any::Any; | ||
use std::sync::Arc; | ||
|
||
/// Arrow Variant [`ExtensionType`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the new canonical variant type
} | ||
|
||
/// An array of Parquet [`Variant`] values | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to update these docstrings to explain what is going on here and how to use the extension types
pub fn try_new(inner: &dyn Array) -> Result<Self, ArrowError> { | ||
// Workaround lack of support for Binary | ||
// https://github.com/apache/arrow-rs/issues/8387 | ||
let inner = cast_to_binary_view_arrays(inner)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast is unfortunate, but I think it is the best (simplest) solution while we work on the real solution
} | ||
|
||
/// Returns a new DataType representing this VariantArray's inner type | ||
pub fn data_type(&self) -> &DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods are just moved from the Array impl
}; | ||
|
||
// Extract value and typed_value fields (metadata is not expected in ShreddedVariantFieldArray) | ||
let value = inner_struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to ShreddingState
/// let struct_array: StructArray = get_struct_array(); | ||
/// let shredding_state = ShreddingState::try_from(&struct_array).unwrap(); | ||
/// ``` | ||
pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now infallable per @scovich 's suggestion #8365 (comment)
} | ||
} | ||
|
||
impl Array for VariantArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved up in the file
pub(crate) enum ShreddedPathStep { | ||
/// Path step succeeded, return the new shredding state | ||
Success(&'a ShreddingState), | ||
Success(ShreddingState), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only part I am a little sad about -- that it now requires owing the ShreddingState
which requires clone the state. While this is typically just a few Arc
s it is still unfortunate
However, unless we change VariantArray to be reference based (e.g. VariantArray<'a>
) I can't think of any way around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if somebody is concerned about cloning ArrayRef
like it's free... arrow-rs might be the wrong project for them to look very closely at?
Some potential ideas:
- Have the caller of
variant_get
pass an ownedVariantArray
, and then deconstruct it using some kind ofinto_parts
method? Caller anyway had to clone the various innards of the&dyn Array
it would have started out with? But that has the distinct disadvantage of producing a logically read-only operation that consumes its input. - If we're ok with VariantArray being an exclusively local thing (i.e. never returned from functions), we could make both it and
ShreddingState
take references to their various arrays. That would produceVariantArray<'a>
,ShreddingState<'a>
, etc. - Make
ShreddingState
trackOption<Cow<...>>
internally, which might solve most of the ergonomics issues of 2/ without overly polluting the public API. The only downside is a generic lifetime parameter:VariantArray<'a>
,ShreddingState<'a>
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think BinaryViewArrays also have a Vec
that will get cloned. However, I think it is ok to clone for now, as the path / cloning happens once per array and so is likely amortized over a batch.
I tried option 1 briefly but I couldn't get it to work.
I'll let benchmarking be our guide and if we see problems with this clone, then perhaps we can pursue options 2 or 3.
"expected a VariantArray as the input for variant_get".to_owned(), | ||
) | ||
})?; | ||
let variant_array = VariantArray::try_new(input)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will admit the new try_new
is quite a bit more ergonomic
.column_by_name("var") | ||
.unwrap_or_else(|| panic!("No 'var' column found in parquet file {path:?}")); | ||
|
||
// the values are read as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is moved to VariantArray::try_new
@scovich @klion26 and @codephage2020 this PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the Array
implementation from VariantArray
and ShreddedVariantFieldArray
and introduces a new VariantType
extension type to properly represent Variant data in the Arrow ecosystem. This change enables proper integration with Parquet I/O operations which previously failed due to implementation issues.
- Removes
Array
trait implementations that were causing Parquet integration problems - Introduces
VariantType
extension type following Arrow conventions for canonical representation - Updates all code to use explicit conversions between
VariantArray
andArrayRef
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
parquet-variant-compute/src/variant_array.rs | Implements VariantType extension type, removes Array trait impls, adds conversion methods |
parquet-variant-compute/src/variant_get.rs | Updates to use explicit VariantArray conversions instead of downcasting |
parquet-variant-compute/src/variant_to_arrow.rs | Updates return type construction for VariantArray |
parquet-variant-compute/src/variant_array_builder.rs | Updates VariantArray construction call |
parquet-variant-compute/src/lib.rs | Exports new VariantType |
parquet/tests/variant_integration.rs | Removes casting workaround code that's now handled internally |
parquet-variant-compute/benches/variant_kernels.rs | Updates array conversion |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
inner: inner.clone(), | ||
metadata: metadata.clone(), | ||
shredding_state: ShreddingState::try_new(value, typed_value)?, | ||
shredding_state: ShreddingState::new(value, typed_value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ShreddingState::new
method is called here but the old try_new
method was removed. Consider adding error handling if ShreddingState::new
can potentially fail, or document why this is safe by construction.
Copilot uses AI. Check for mistakes.
After merging up from #8366, a new shred_object test is now failing:
I don't fully understand what this means and haven't had a chance to debug it. @scovich any ideas? |
The error means that some non-nullable column has null mask bits that are not covered by any ancestor column. The fields of a shredded variant object are not nullable -- NULL is encoded by the child It's also possible that some struct (or variant array) is being wrongly constructed with some non-nullable fields that should actually be nullable? |
The offending array is below. I notice a couple weird things:
|
@alamb -- Somehow you've managed to trigger a latent bug in my variant shredding code. I have no idea how the test ever passed before, nor why your PR should have anything to do with anything. But here's the fix that I confirmed unbreaks your PR when I stacked them both locally: |
# Which issue does this PR close? - Fast-follow for #8366 - Related to #8392 # Rationale for this change Somehow, #8392 exposes a latent bug in #8366, which has improper NULL handling for shredded object fields. The shredding PR originally attempted to handle this case, but somehow the test did not trigger the bug and so the (admittedly incomplete) code was removed. See #8366 (comment). To be honest, I have no idea how the original ever worked correctly, nor why the new PR is able to expose the problem. # What changes are included in this PR? When used as a top-level builder, `VariantToShreddedVariantRowBuilder::append_null` must append NULL to its own `NullBufferBuilder`; but when used as a shredded object field builder, it must append non-NULL. Plumb a new `top_level` parameter through the various functions and into the two sub-builders it relies on, so they can implement the correct semantics. # Are these changes tested? In theory, yes (I don't know how the object shredding test ever passed). And it fixes the breakage in #8392. # Are there any user-facing changes? No
Thanks! I merged that in and merged up from main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The only open question was about shredding state cloning, but that doesn't affect correctness AFAIK
// Note don't check for metadata/value fields here because they may be | ||
// absent in shredded variants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't sound quite right -- AFAIK the extension type should only annotate the top-level variant. Any shredded object fields or array elements nested inside should be identified according to spec, because at that point we know we're traversing a shredded variant (by virtue of having stepped into the typed_value
column of a VariantArray
. If anything looks out of place when trying to step into a shredded struct field or array element, that's just plain an error. No annotations needed to figure that out.
Even if, for some reason, we decide it's important not to require metadata
, we should still enforce that the only fields present are some subset of metadata
, value
and typed_value
, where the first two are binary. TBD whether we recurse into typed_value
to verify the rest of the structure -- we probably should, but that could become expensive if people are converting from ArrayRef to VariantArray to ArrayRef a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good -- removed the comments
} | ||
/// Is the element at index null? | ||
pub fn is_null(&self, index: usize) -> bool { | ||
self.nulls().map(|n| n.is_null(index)).unwrap_or_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just
self.nulls().map(|n| n.is_null(index)).unwrap_or_default() | |
self.nulls().map_or(false, |n| n.is_null(index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and in fact when I did that clippy told me it could be simplified further to
self.nulls().is_some_and( |n| n.is_null(index))
Co-authored-by: Ryan Johnson <scovich@users.noreply.github.com>
Which issue does this PR close?
VariantArray::data_type
returnsStructType
, causingArray::as_struct
to panic #8319Rationale for this change
This is needed to read/write the Variant Parquet logical type and work with the rest of the Arrow Ecosystem
Note, this is broken out the larger PR here:
VariantArray
to parquet withVariant
LogicalType #8365We need a way to write Variant encoded data to/from parquet, and the current way the VariantArray is implemented doesn't work (panics when writing to parquet). See tickets above
Instead of a
impl Array
it seems the better way to do this is using an Arrow Extension Type. See #8319 (comment) for more detailsWhat changes are included in this PR?
Array
impl forVariantArray
, which forces explict conversions back/forth when reading/writingArray
impl forShreddedVariantFieldArray
, which forces explicit conversions back/forth when reading/writingVariantType
extension typeAre these changes tested?
Yes, with new code and tests
Are there any user-facing changes?
Yes, but this is not yet stable / released, so these changes have no impact on the releasability of this code