Skip to content

Conversation

ding-young
Copy link
Contributor

@ding-young ding-young commented Aug 13, 2025

Which issue does this PR close?

Rationale for this change

DataFusion's sort implementation converts batches into a row-based format when there are two or more sort columns, in order to implement efficient comparisons using cursors. Currently, the memory reservation for each batch is estimated as roughly the size of the original batch, to account for both the original and the cursor batch. However, the actual memory usage of the cursor may vary significantly, especially when payload columns and sort columns vary, leading to potential over- or under-estimation.

What changes are included in this PR?

This PR addresses two memory accounting issues:

  1. Do col->row conversion first and use the ratio for memory reservation
    We now convert the first batch to a row-based cursor, compute the memory ratio between the original batch and the cursor batch, and record it. This ratio is then used in the get_reserved_byte_for_record_batch_size function.

TODO

Are these changes tested?

I ran sort-tpch with a memory limit, and more queries are now able to complete successfully.
While the performance of external sorts may have improved due to more accurate memory reservation, there’s a possibility of regression for small or fast queries since the first batch is converted and then discarded solely for estimation purposes.
(This hasn’t been tested yet, but it's something to be aware of.)

  • Running sort-tpch with 1 partition 100M memory, now Q5 succeeds
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        main ┃ calculate-mem-row ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  1669.47 ms │        1519.03 ms │ +1.10x faster │
│ Q2           │  1681.37 ms │        1646.89 ms │     no change │
│ Q3           │  5577.40 ms │        5458.33 ms │     no change │
│ Q4           │  1831.07 ms │        1905.28 ms │     no change │
│ Q5           │        FAIL │        2779.55 ms │  incomparable │
│ Q6           │  3909.66 ms │        4108.29 ms │  1.05x slower │
│ Q7           │ 15915.65 ms │        8958.87 ms │ +1.78x faster │
│ Q8           │  5489.77 ms │        4835.37 ms │ +1.14x faster │
│ Q9           │  5649.99 ms │        5297.03 ms │ +1.07x faster │
│ Q10          │ 23783.17 ms │       11217.44 ms │ +2.12x faster │
│ Q11          │  4251.77 ms │        4162.19 ms │     no change │
└──────────────┴─────────────┴───────────────────┴───────────────┘
  • Running sort-tpch with 4 partition 100M memory, now Q8, 11 succeeds
    No. after using correct memory size for StringViewArray, this PR still suffers from these query failures.

cargo run --profile release-nonlto --bin dfbench -- sort-tpch --path benchmarks/data/tpch_sf1 --partitions 4 --memory-limit 100M

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ calculate-mem-row ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1           │  984.22 ms │         994.77 ms │    no change │
│ Q2           │  961.59 ms │         991.49 ms │    no change │
│ Q3           │       FAIL │              FAIL │ incomparable │
│ Q4           │ 1395.46 ms │        1492.62 ms │ 1.07x slower │
│ Q5           │ 1658.36 ms │        1581.73 ms │    no change │
│ Q6           │ 1682.21 ms │        1910.95 ms │ 1.14x slower │
│ Q7           │       FAIL │              FAIL │ incomparable │
│ Q8           │       FAIL │        3271.85 ms │ incomparable │
│ Q9           │       FAIL │              FAIL │ incomparable │
│ Q10          │       FAIL │              FAIL │ incomparable │
│ Q11          │       FAIL │        2309.35 ms │ incomparable │
└──────────────┴────────────┴───────────────────┴──────────────┘

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Aug 13, 2025
Comment on lines +316 to +318
// TODO since we col->row conversion first, sort succeeds without limited sort_spill_reservation_bytes
test.clone()
.with_expected_errors(vec![
"Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:",
"B for ExternalSorterMerge",
])
.with_expected_success()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether the sort_spill_reservation_bytes configuration will still be necessary after this change. Since we're now estimating the memory required for the merge phase based on the actual ratio measured from the first batch, the manual override may become less important in practice.

@ding-young ding-young changed the title [Draft] More accurate memory accounting in external sort More accurate memory accounting in external sort Aug 20, 2025
@ding-young ding-young changed the title More accurate memory accounting in external sort Improve memory accounting for cursor in external sort Aug 20, 2025
@ding-young ding-young marked this pull request as ready for review August 20, 2025 03:13
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work. The idea looks good to me, I'm trying to figure out why the last batch can be too large.

BTW, what's the reason for the remaining sort_tpch query failures?

@@ -88,6 +88,10 @@ pub struct StreamingMergeBuilder<'a> {
fetch: Option<usize>,
reservation: Option<MemoryReservation>,
enable_round_robin_tie_breaker: bool,
/// Ratio of memory used by cursor batch to the original input RecordBatch.
/// Used in `get_reserved_byte_for_record_batch_size` to estimate required memory for merge phase.
/// Only passed when constructing MultiLevelMergeBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Only passed when constructing MultiLevelMergeBuilder
/// Only passed when constructing MultiLevelMergeBuilder
///
/// A cursor is an interface for comparing entries between two batches.
/// It includes the original batch and the sort keys converted to Arrow Row format
/// for faster comparison. See `cursor.rs` for more details.

Comment on lines +361 to +377
// Only for first time
if self.cursor_batch_ratio.is_none() {
let ratio = self.calculate_ratio(&input)?;
self.cursor_batch_ratio = Some(ratio);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we defer this until we actually need it? as if each sort only get a single batch or we do in place sorting we will convert it to rows without need

@ding-young
Copy link
Contributor Author

BTW, what's the reason for the remaining sort_tpch query failures?
There are several different failure points, but here are the main types I’ve observed so far:

(1) During insert_batch, we hit a memory limit and attempt to call sort_and_spill_in_mem_batches, which then fails.

(2) During sort, when there have already been previous spills in insert_batch, we try to call sort_and_spill_in_mem_batch again before merging via StreamingMergeBuilder, and it fails.

(3) In multi-level merge, when we can no longer reduce the number of spill files or buffer size to fit the memory limit.

For (1) and (2), the core issue is that in order to spill in-memory batches, we need to sort them via streaming merge using a loser tree. But since all in-memory batches are already buffered at that point, attempting to merge them all as loser tree inputs further increases memory pressure, leading to failure.

I’m currently experimenting with limiting the fanout (merge degree) of the loser tree to see if this can help mitigate the problem.

let sort_res = self.reservation.try_grow(batch_size);

// if cursor is smaller than half of original batch, we may say that 2x batch is enough for both sort and merge phase
let cursor_small = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it's doing: if rows/original_batch < 1.0, round up the rows size estimation to 1.0 instead of the actual ratio. Why doing this round up?

BTW is this PR ready to review now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The round up here is still a heuristic. It's hard to provide a strict guarantee that StreamingMerge won’t fail, even if we use the estimated ratio to increase merge_reservation. So for now, I kept the conservative approach using the original 2x ratio when the estimated ratio is < 1.0.It’s a temporary workaround and might be revised later.

I still need to clean up some of the comments and test code, so it’d be great if you could take a look on the core logic for now.

Side note: this PR does fix the sort-tpch Q5 issue mentioned above, but based on my recent investigation into the StringViewArray memory usage problem, this PR doesn’t seem to resolve the failures in other queries (Q3, Q7-11 that includes VARCHAR col) if I make the correct memory estimation on StringViewArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sort-tpch Q5 fails when memory-limit is intermediate while succeeds with smaller memory More accurate memory accounting in external sort
3 participants