Improve memory accounting for cursor in external sort #17163

ding-young · 2025-08-13T03:01:41Z

Which issue does this PR close?

Closes More accurate memory accounting in external sort #14748 and
Closes Sort-tpch Q5 fails when memory-limit is intermediate while succeeds with smaller memory #16979 .

Rationale for this change

DataFusion's sort implementation converts batches into a row-based format when there are two or more sort columns, in order to implement efficient comparisons using cursors. Currently, the memory reservation for each batch is estimated as roughly 2× the size of the original batch, to account for both the original and the cursor batch. However, the actual memory usage of the cursor may vary significantly, especially when payload columns and sort columns vary, leading to potential over- or under-estimation.

What changes are included in this PR?

This PR addresses two memory accounting issues:

Do col->row conversion first and use the ratio for memory reservation
We now convert the first batch to a row-based cursor, compute the memory ratio between the original batch and the cursor batch, and record it. This ratio is then used in the get_reserved_byte_for_record_batch_size function.

TODO

Are these changes tested?

I ran sort-tpch with a memory limit, and more queries are now able to complete successfully.
While the performance of external sorts may have improved due to more accurate memory reservation, there’s a possibility of regression for small or fast queries since the first batch is converted and then discarded solely for estimation purposes.
(This hasn’t been tested yet, but it's something to be aware of.)

Running sort-tpch with 1 partition 100M memory, now Q5 succeeds

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        main ┃ calculate-mem-row ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  1669.47 ms │        1519.03 ms │ +1.10x faster │
│ Q2           │  1681.37 ms │        1646.89 ms │     no change │
│ Q3           │  5577.40 ms │        5458.33 ms │     no change │
│ Q4           │  1831.07 ms │        1905.28 ms │     no change │
│ Q5           │        FAIL │        2779.55 ms │  incomparable │
│ Q6           │  3909.66 ms │        4108.29 ms │  1.05x slower │
│ Q7           │ 15915.65 ms │        8958.87 ms │ +1.78x faster │
│ Q8           │  5489.77 ms │        4835.37 ms │ +1.14x faster │
│ Q9           │  5649.99 ms │        5297.03 ms │ +1.07x faster │
│ Q10          │ 23783.17 ms │       11217.44 ms │ +2.12x faster │
│ Q11          │  4251.77 ms │        4162.19 ms │     no change │
└──────────────┴─────────────┴───────────────────┴───────────────┘

~~Running sort-tpch with 4 partition 100M memory, now Q8, 11 succeeds~~
No. after using correct memory size for StringViewArray, this PR still suffers from these query failures.

cargo run --profile release-nonlto --bin dfbench -- sort-tpch --path benchmarks/data/tpch_sf1 --partitions 4 --memory-limit 100M

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ calculate-mem-row ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Q1           │  984.22 ms │         994.77 ms │    no change │
│ Q2           │  961.59 ms │         991.49 ms │    no change │
│ Q3           │       FAIL │              FAIL │ incomparable │
│ Q4           │ 1395.46 ms │        1492.62 ms │ 1.07x slower │
│ Q5           │ 1658.36 ms │        1581.73 ms │    no change │
│ Q6           │ 1682.21 ms │        1910.95 ms │ 1.14x slower │
│ Q7           │       FAIL │              FAIL │ incomparable │
│ Q8           │       FAIL │        3271.85 ms │ incomparable │
│ Q9           │       FAIL │              FAIL │ incomparable │
│ Q10          │       FAIL │              FAIL │ incomparable │
│ Q11          │       FAIL │        2309.35 ms │ incomparable │
└──────────────┴────────────┴───────────────────┴──────────────┘

Are there any user-facing changes?

ding-young · 2025-08-20T02:45:41Z

datafusion/core/tests/memory_limit/mod.rs

+    // TODO since we col->row conversion first, sort succeeds without limited sort_spill_reservation_bytes
    test.clone()
-        .with_expected_errors(vec![
-            "Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:",
-            "B for ExternalSorterMerge",
-        ])
+        .with_expected_success()


I'm wondering whether the sort_spill_reservation_bytes configuration will still be necessary after this change. Since we're now estimating the memory required for the merge phase based on the actual ratio measured from the first batch, the manual override may become less important in practice.

datafusion/physical-plan/src/sorts/sort.rs

2010YOUY01

Thanks for the work. The idea looks good to me, I'm trying to figure out why the last batch can be too large.

BTW, what's the reason for the remaining sort_tpch query failures?

2010YOUY01 · 2025-08-20T03:53:14Z

datafusion/physical-plan/src/sorts/streaming_merge.rs

@@ -88,6 +88,10 @@ pub struct StreamingMergeBuilder<'a> {
    fetch: Option<usize>,
    reservation: Option<MemoryReservation>,
    enable_round_robin_tie_breaker: bool,
+    /// Ratio of memory used by cursor batch to the original input RecordBatch.
+    /// Used in `get_reserved_byte_for_record_batch_size` to estimate required memory for merge phase.
+    /// Only passed when constructing MultiLevelMergeBuilder


Suggested change

/// Only passed when constructing MultiLevelMergeBuilder

/// Only passed when constructing MultiLevelMergeBuilder

///

/// A cursor is an interface for comparing entries between two batches.

/// It includes the original batch and the sort keys converted to Arrow Row format

/// for faster comparison. See `cursor.rs` for more details.

datafusion/physical-plan/src/sorts/sort.rs

datafusion/core/tests/fuzz_cases/spilling_fuzz_in_memory_constrained_env.rs

datafusion/physical-plan/src/sorts/sort.rs

rluvaton · 2025-08-20T14:50:01Z

datafusion/physical-plan/src/sorts/sort.rs

+        // Only for first time
+        if self.cursor_batch_ratio.is_none() {
+            let ratio = self.calculate_ratio(&input)?;
+            self.cursor_batch_ratio = Some(ratio);
+        }


Can we defer this until we actually need it? as if each sort only get a single batch or we do in place sorting we will convert it to rows without need

datafusion/physical-plan/src/sorts/sort.rs

datafusion/physical-plan/src/sorts/streaming_merge.rs

ding-young · 2025-08-22T02:13:25Z

BTW, what's the reason for the remaining sort_tpch query failures?
There are several different failure points, but here are the main types I’ve observed so far:

(1) During insert_batch, we hit a memory limit and attempt to call sort_and_spill_in_mem_batches, which then fails.

(2) During sort, when there have already been previous spills in insert_batch, we try to call sort_and_spill_in_mem_batch again before merging via StreamingMergeBuilder, and it fails.

(3) In multi-level merge, when we can no longer reduce the number of spill files or buffer size to fit the memory limit.

For (1) and (2), the core issue is that in order to spill in-memory batches, we need to sort them via streaming merge using a loser tree. But since all in-memory batches are already buffered at that point, attempting to merge them all as loser tree inputs further increases memory pressure, leading to failure.

I’m currently experimenting with limiting the fanout (merge degree) of the loser tree to see if this can help mitigate the problem.

2010YOUY01 · 2025-08-26T12:04:46Z

datafusion/physical-plan/src/sorts/sort.rs

+        let sort_res = self.reservation.try_grow(batch_size);
+
+        // if cursor is smaller than half of original batch, we may say that 2x batch is enough for both sort and merge phase
+        let cursor_small = self


Here it's doing: if rows/original_batch < 1.0, round up the rows size estimation to 1.0 instead of the actual ratio. Why doing this round up?

BTW is this PR ready to review now?

The round up here is still a heuristic. It's hard to provide a strict guarantee that StreamingMerge won’t fail, even if we use the estimated ratio to increase merge_reservation. So for now, I kept the conservative approach using the original 2x ratio when the estimated ratio is < 1.0.It’s a temporary workaround and might be revised later.

I still need to clean up some of the comments and test code, so it’d be great if you could take a look on the core logic for now.

Side note: this PR does fix the sort-tpch Q5 issue mentioned above, but based on my recent investigation into the StringViewArray memory usage problem, this PR doesn’t seem to resolve the failures in other queries (Q3, Q7-11 that includes VARCHAR col) if I make the correct memory estimation on StringViewArray.

github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Aug 13, 2025

ding-young force-pushed the calculate-mem-row branch from 6e889a2 to 8b4a19e Compare August 20, 2025 02:37

ding-young commented Aug 20, 2025

View reviewed changes

datafusion/physical-plan/src/sorts/sort.rs Show resolved Hide resolved

ding-young mentioned this pull request Aug 20, 2025

Validate the memory consumption in SPM created by multi level merge #17029

Open

ding-young changed the title ~~[Draft] More accurate memory accounting in external sort~~ More accurate memory accounting in external sort Aug 20, 2025

ding-young changed the title ~~More accurate memory accounting in external sort~~ Improve memory accounting for cursor in external sort Aug 20, 2025

ding-young marked this pull request as ready for review August 20, 2025 03:13

2010YOUY01 reviewed Aug 20, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Aug 20, 2025

fix: sort should always output batches with batch_size rows #17244

Merged

rluvaton reviewed Aug 20, 2025

View reviewed changes

datafusion/core/tests/fuzz_cases/spilling_fuzz_in_memory_constrained_env.rs Outdated Show resolved Hide resolved

rluvaton reviewed Aug 20, 2025

View reviewed changes

datafusion/physical-plan/src/sorts/sort.rs Outdated Show resolved Hide resolved

rluvaton reviewed Aug 20, 2025

View reviewed changes

datafusion/physical-plan/src/sorts/sort.rs Outdated Show resolved Hide resolved

rluvaton reviewed Aug 20, 2025

View reviewed changes

datafusion/physical-plan/src/sorts/streaming_merge.rs Outdated Show resolved Hide resolved

ding-young added 10 commits August 25, 2025 07:57

calculate reserved memory based on expected col-row ratio

4d06187

pass cursor-batch-ratio to MultilLevelMerge

a13b249

slice last batch when appending globally sorted batch to spill file

3c38350

fix organize_string_view after slicing

9c5abe5

update comments

4e3d31f

move slicing logic into in_mem_sort_stream

529ad21

chore: use default ratio for single batch

d773bc7

add TODO comment for edge case handling

791da14

reserve memory for cursor in merge_reservation

665ffe7

add sort fuzz test that requires large Row format as cursor

e44ca83

ding-young force-pushed the calculate-mem-row branch from 8b4a19e to e44ca83 Compare August 25, 2025 08:06

2010YOUY01 reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve memory accounting for cursor in external sort #17163

Improve memory accounting for cursor in external sort #17163

Uh oh!

ding-young commented Aug 13, 2025 •

edited

Loading

Uh oh!

ding-young Aug 20, 2025

Uh oh!

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rluvaton Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

ding-young commented Aug 22, 2025

Uh oh!

2010YOUY01 Aug 26, 2025

Uh oh!

ding-young Aug 26, 2025

Uh oh!

Uh oh!

-    /// Only passed when constructing MultiLevelMergeBuilder
+    /// Only passed when constructing MultiLevelMergeBuilder
+    ///
+    /// A cursor is an interface for comparing entries between two batches.
+    /// It includes the original batch and the sort keys converted to Arrow Row format
+    /// for faster comparison. See `cursor.rs` for more details.

Improve memory accounting for cursor in external sort #17163

Are you sure you want to change the base?

Improve memory accounting for cursor in external sort #17163

Uh oh!

Conversation

ding-young commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ding-young Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rluvaton Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ding-young commented Aug 22, 2025

Uh oh!

2010YOUY01 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ding-young commented Aug 13, 2025 •

edited

Loading