-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: sort should always output batches with batch_size
rows
#17244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: sort should always output batches with batch_size
rows
#17244
Conversation
output = Box::pin(BatchSplitStream::new( | ||
output, | ||
self.batch_size, | ||
self.metrics.split_metrics.clone(), | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I have with BatchSplitStream
is that it is not accounting for memory that it hold when it hold the batches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here caller is already accounting the memory, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our case yes
batch_size
rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. It's well written and well tested.
We found an issue in #17163, and this PR is the direct fix.
output = Box::pin(BatchSplitStream::new( | ||
output, | ||
self.batch_size, | ||
self.metrics.split_metrics.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we construct a dummy metric here? If we add it into operator's metrics set, it will display inside EXPLAIN ANALYZE
, and I think this metric is not useful in SortExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not, split can also happen before spill so it can be called more than one time
output = Box::pin(BatchSplitStream::new( | ||
output, | ||
self.batch_size, | ||
self.metrics.split_metrics.clone(), | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here caller is already accounting the memory, right?
Thank you for the fix! I actually encountered the same issue, and really glad to see this PR. BTW, have you considered directly creating a stream that sorts a batch and slices it into multiple Something like this: https://gist.github.com/ding-young/25ce38cd6449a3c35bc0df430e6a2eab I'm just sharing the idea since you mentioned the issue with Plus, after this PR gets merged, I'll be glad to hear your review on #17163 |
Yes but decided not to for:
|
Which issue does this PR close?
N/A
Rationale for this change
In
ExternalSorter
there are cases where thebatch_size
config is not respectedWhat changes are included in this PR?
Wrap the stream with
BatchSplitStream
when we only have a single batch to sort and when we can sort in place (the paths that did not respect the batch size)Are these changes tested?
Yes
Are there any user-facing changes?
Not an API change