Re-enable and extend chunked pack benchmarks by nirandaperera · Pull Request #781 · rapidsai/rapidsmpf

nirandaperera · 2026-01-12T21:54:58Z

Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.

This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.

Latest results

https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing

Previous results

Workstation RTX A6000 driver 580.105.08 CUDA 13.0

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,

A6000		H100
BM_Pack_device_copy_to_pinned_host	22,760.87	BM_Pack_device_copy_to_pinned_host	45,062.92
BM_ChunkedPack_device_copy_to_pinned_host	22,483.63	BM_ChunkedPack_device_copy_to_pinned_host	43,336.96
BM_ChunkedPack_device_copy_to_host	21,823.79	BM_ChunkedPack_pinned_copy_to_pinned_host	22,014.05
BM_Pack_device_copy_to_host	21,011.26	BM_ChunkedPack_device_copy_to_host	20,057.96
BM_ChunkedPack_pinned_copy_to_pinned_host	11,564.92	BM_ChunkedPack_pinned_copy_to_host	14,565.09
BM_ChunkedPack_pinned_copy_to_host	11,346.53	BM_Pack_device_copy_to_host	14,189.53
BM_Pack_pinned_copy_to_pinned_host	9,027.17	BM_Pack_pinned_copy_to_pinned_host	7,902.45
BM_Pack_pinned_copy_to_host	8,462.00	BM_Pack_pinned_copy_to_host	869.39

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…nch-pack

nirandaperera · 2026-01-13T00:46:07Z

The latest results on my workstation. I need to verify the results in a H100 machine as well

madsbk · 2026-01-13T08:01:57Z

cpp/benchmarks/bench_pack.cpp

            );
            RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync(
-                static_cast<std::uint8_t*>(destination.data()) + offset,
+                reinterpret_cast<std::uint8_t*>(destination.data()) + offset,


isn't destination.data() a std::byte* already?

nirandaperera · 2026-01-13T19:43:14Z

Following are the results from PDX H100 nodes (with cuda-compat 13.1)

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…mpf into reenable-bench-pack

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera · 2026-01-14T22:42:44Z

I think I found the issue for the poor cudf::pack performance a pinned mr. It turns out that pinned memory pool is VERY slow to start off, so the first bench iteration takes ~1s. Since I have not set any min limits, it stops there and only reports the 1st iteration results 😢 When I set min_time=4s warm_up_time=1s this discrepancy (falling off a cliff) falls away.

Updated results

wence-

A few comments

wence- · 2026-01-14T09:58:00Z

cpp/benchmarks/bench_pack.cpp

+    // Warm up
+    auto warm_up = cudf::pack(table.view(), stream, pack_mr);
+
+    rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr);


You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.

wence- · 2026-01-14T09:59:03Z

cpp/benchmarks/bench_pack.cpp

+    rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{
+        cuda_mr, rmm::percent_of_free_device_memory(40)
+    };


Throughout, this is a mad memory resource, and not one we ever use.

Just use the async memory resource.

@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269

wence- · 2026-01-14T10:10:59Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
+


I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.

Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

Signed-off-by: niranda perera <niranda.perera@gmail.com>

…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera · 2026-01-20T20:56:29Z

@wence- @madsbk I think I have added all the combinations (almost) now. Can you take another look?

nirandaperera · 2026-01-20T20:59:34Z

Latest results.

My revised conclusions are (based on destination reservation),

For variable sized destination buffers

device

Use pack directly. Its significantly faster that chunked_pack . Since the output reservation is provided, we can assume that there is at least O(table size) amount of device memory is available.
chunked_pack to destination buffer offsets reaches pack perf for larger tables. So maybe we can think about this for larger tables?

pinned host

pack using a pinned mr is faster for smaller tables (<100MB). But chunked_pack to destination pinned buffer offsets is faster and stable for larger tables (this maybe because pack destination buffer allocation time is included in the timings)
pack to device and copying to pinned buffer/ chunked_pack to device bounce buffer have comparable performance, but not greater.

host

pack to device and copying/ chunked_pack to device bounce buffer have very similar performance.
similarly, pack to pinned and copying/ chunked_pack to pinned bounce buffer have very similar performance, but slower than device bounce buffers
So, we can use chunked_pack always, and pick the bounce buffer based on availability.

The impact of the bounce buffer size in chunked_pack is shown here.

In H100, If the destination buffer is,

pinned host - perf increases and saturates around 21GB/s for >8MB bounce buffers for a 1GB table
device - perf increases and saturates around 600GB/s for ~512MB bounce buffers for a 1GB table (so, essentially this reached pack )
In both cases, smaller bounce buffers sizes yield subpar performance.

So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
We could,

Increase the pinned buffer sizes to ~4MB (this would be in-efficient for smaller buffers)
Pack to pinned/ device memory (if available), and then async batch-copy to smaller buffer.

pentschev · 2026-02-05T21:42:51Z

cpp/benchmarks/bench_pack.cpp

+ * @param state The benchmark state
+ * @param table_size_mb The size of the table in MB
+ * @param table_mr The memory resource for the table
+ * @param pack_mr The memory resource for the packed data


Suggested change

* @param pack_mr The memory resource for the packed data

* @param pack_mr The memory resource for the packed data

* @param dest_mr The memory resource for the destination data

pentschev · 2026-02-05T21:50:03Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
+


I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

pentschev · 2026-02-05T21:55:00Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);


We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.

pentschev · 2026-02-05T21:55:16Z

cpp/benchmarks/bench_pack.cpp

+    // Bounce buffer size: max(1MB, table_size / 10)
+    auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);


pentschev · 2026-02-05T21:56:58Z

cpp/benchmarks/bench_pack.cpp

+    std::size_t table_size,
+    rmm::device_async_resource_ref table_mr,
+    rmm::device_async_resource_ref pack_mr,
+    auto& dest_mr,


Why auto and not explicit like others?

pentschev · 2026-02-05T21:58:05Z

cpp/benchmarks/bench_pack.cpp

+    state.counters["bounce_buffer_mb"] =
+        static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);


This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?

pentschev · 2026-02-05T22:01:48Z

cpp/benchmarks/bench_pack.cpp

+ * @param b The benchmark to configure with arguments.
+ */
+void PackArguments(benchmark::internal::Benchmark* b) {
+    // Test different table sizes in MB (minimum 1MB as requested)


As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).

reenable chunked pack benchmarks

3b842c1

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera requested a review from a team as a code owner January 12, 2026 21:54

nirandaperera added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 12, 2026

nirandaperera changed the title ~~Re-enable chunked pack benchmarks~~ Re-enable and extend chunked pack benchmarks Jan 13, 2026

nirandaperera added 2 commits January 12, 2026 16:38

extending bench

aed3a27

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'main' of github.com:rapidsai/rapidsmpf into reenable-be…

6b9f82e

…nch-pack

madsbk reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into reenable-bench-pack

ac9c112

nirandaperera added 6 commits January 13, 2026 14:20

more cases

dea025b

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'reenable-bench-pack' of github.com:nirandaperera/rapids…

f16d260

…mpf into reenable-bench-pack

remvoing case

d485996

Signed-off-by: niranda perera <niranda.perera@gmail.com>

bypass pinned

fe1b080

Signed-off-by: niranda perera <niranda.perera@gmail.com>

chunked pack without bounce buffer

44db8f7

Signed-off-by: niranda perera <niranda.perera@gmail.com>

another bench

73b38a6

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera mentioned this pull request Jan 14, 2026

Enable priming the PinnedMemoryPool #786

Open

wence- reviewed Jan 15, 2026

View reviewed changes

nirandaperera added 3 commits January 15, 2026 10:31

addressing PR comments and fixed sized cases

47a0717

Signed-off-by: niranda perera <niranda.perera@gmail.com>

extending benchmarks

7126e01

Signed-off-by: niranda perera <niranda.perera@gmail.com>

Merge branch 'main' of github.com:rapidsai/rapidsmpf into reenable-be…

12eae84

…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera requested review from madsbk and wence- January 20, 2026 20:49

precommit

2fc4c36

Signed-off-by: niranda perera <niranda.perera@gmail.com>

nirandaperera added 2 commits January 21, 2026 11:57

Merge branch 'main' into reenable-bench-pack

c1e0671

Merge branch 'main' into reenable-bench-pack

71692b8

nirandaperera added 2 commits February 3, 2026 13:14

Merge branch 'main' into reenable-bench-pack

29d5896

Merge branch 'main' into reenable-bench-pack

57012d2

pentschev reviewed Feb 5, 2026

View reviewed changes

		// Bounce buffer size: max(1MB, table_size / 10)
		auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

	* @param pack_mr The memory resource for the packed data
	* @param pack_mr The memory resource for the packed data
	* @param dest_mr The memory resource for the destination data

		state.counters["bounce_buffer_mb"] =
		static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);

Conversation

nirandaperera commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latest results

Previous results

Uh oh!

nirandaperera commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera commented Jan 13, 2026

Uh oh!

nirandaperera commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera commented Jan 20, 2026

Uh oh!

nirandaperera commented Jan 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nirandaperera commented Jan 12, 2026 •

edited

Loading

nirandaperera commented Jan 13, 2026 •

edited

Loading

madsbk Jan 13, 2026 •

edited

Loading

nirandaperera commented Jan 14, 2026 •

edited

Loading