Skip to content

Conversation

ding-young
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

When we spill RecordBatch on external sorting, we call batch.get_sliced_size to report max_record_batch_size in each spill file. Thereby we can safely read the batch back from spill file with correct memory estimation. However, it is found (in #17029) that get_sliced_size underestimated the memory size for StringViewArray. The diff was quite large i.e. when correct batch size was 1392730 bytes, it returned 163840 bytes, which means the actual memory usage may far exceed the configured memory_limit, especially when merging spill file.

This underestimation happens because arrow api get_slice_memory_size counts for only views buffer for StringViewArray while ignoring the capacity for data (payload) buffer.

See
https://github.com/apache/arrow-rs/blob/a620957bc98b7aa14faec10635bb798932f00bf9/arrow-data/src/data.rs#L464-L476
https://github.com/apache/arrow-rs/blob/a620957bc98b7aa14faec10635bb798932f00bf9/arrow-data/src/data.rs#L1743-L1752

What changes are included in this PR?

This PR corrects the memory estimation for batch.get_sliced_size for StringViewArray by manually adding up the capacity for data buffers. Since we call this function after StringViewArray.gc, it is safe to add the buffer capacity directly.

We may need to upstream this patch if needed.

Are these changes tested?

Yes.

Are there any user-facing changes?

Some external sort queries (including those that use StringViewArray) may fail more frequently under tight memory limits, because we fix the underestimation. However, since this reflects actual memory usage and we're still working on the fix, so I think it's okay to proceed.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Aug 25, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thank you.

I suggest to open an issue in arrow-rs, and link the issue in the implementation comment.

// Therefore, we manually add the sum of the lengths used by all non inlined views
// on top of the sliced size for views buffer. This matches the intended semantics of
// "bytes needed if we materialized exactly this slice into fresh buffers".
// Note: if multiple arrays share the same data buffers, we may double count each StringViewArray.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to move this 'Note' to function comment, this is important to know when using the function.

This double counting buffer issue seems not possible, inside the same StringViewArray the buffers should be distinct, sharing is only possible among different arrays. Maybe we should only keep the 'remeber to gc()' part in the note.

Copy link
Contributor

@ctsk ctsk Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not guaranteed that the buffers inside a StringViewArray are distinct :/

Or to be more precise: It does definitely occur that a StringViewArray holds multiple references to the same buffer due, because the concat kernel does not ensure that buffers are unique. One common case is when datafusion processes a join, and then coalesces the batches that are emitted from the join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, for concat kernel it does't necessarily do gc(). It's really tricky to get StringView used correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adjusting the comment location be sufficient in this PR, or do you think we should revise the memory calculation to account for potential buffer sharing within a single array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh we also need to revise the memory calculation. I'll bring up with the fix that considers shared buffers inside single array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, now we always gc() on StringView columns before spilling, and it's the only use case for this function. So there is no need to account duplicated buffers inside this function. Adding a comment is enough for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh we also need to revise the memory calculation. I'll bring up with the fix that considers shared buffers inside single array.

I think get_record_batch_memory_size() has already addressed such case 🤔 The only remaining issue is if there are buffer sharing among different batches, it overestimates, but this seems not to be a big issue for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would not be an issue for most cases, so I only updated the comment.

IIRC, now we always gc() on StringView columns before spilling

However it seems like we don't gc() on StringView if it spills on multi level merge. Maybe we should find out whether we need to gc for specific cases (i.e. when in-memory streams are involved) or update this calculation logic similar to get_record_batch_memory_size

Copy link
Contributor

@2010YOUY01 2010YOUY01 Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However it seems like we don't gc() on StringView if it spills on multi level merge.

I remember it caused spill file size to explode in some cases, so perhaps we should move the gc() logic inside the spill utility, and this can ensure gc() on string views before spilling.

@ding-young ding-young force-pushed the fix-string-view-sliced-size branch from 227bd12 to 3f33d3e Compare August 31, 2025 06:41
@2010YOUY01 2010YOUY01 merged commit cf71d89 into apache:main Sep 2, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants