Skip to content

Re-enable and extend chunked pack benchmarks#781

Open
nirandaperera wants to merge 18 commits intorapidsai:mainfrom
nirandaperera:reenable-bench-pack
Open

Re-enable and extend chunked pack benchmarks#781
nirandaperera wants to merge 18 commits intorapidsai:mainfrom
nirandaperera:reenable-bench-pack

Conversation

@nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented Jan 12, 2026

Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.

This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.

Latest results

https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing

Previous results

Workstation RTX A6000 driver 580.105.08 CUDA 13.0
image

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)
image

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,

A6000 H100
BM_Pack_device_copy_to_pinned_host 22,760.87 BM_Pack_device_copy_to_pinned_host 45,062.92
BM_ChunkedPack_device_copy_to_pinned_host 22,483.63 BM_ChunkedPack_device_copy_to_pinned_host 43,336.96
BM_ChunkedPack_device_copy_to_host 21,823.79 BM_ChunkedPack_pinned_copy_to_pinned_host 22,014.05
BM_Pack_device_copy_to_host 21,011.26 BM_ChunkedPack_device_copy_to_host 20,057.96
BM_ChunkedPack_pinned_copy_to_pinned_host 11,564.92 BM_ChunkedPack_pinned_copy_to_host 14,565.09
BM_ChunkedPack_pinned_copy_to_host 11,346.53 BM_Pack_device_copy_to_host 14,189.53
BM_Pack_pinned_copy_to_pinned_host 9,027.17 BM_Pack_pinned_copy_to_pinned_host 7,902.45
BM_Pack_pinned_copy_to_host 8,462.00 BM_Pack_pinned_copy_to_host 869.39

Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera nirandaperera requested a review from a team as a code owner January 12, 2026 21:54
@nirandaperera nirandaperera added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 12, 2026
@nirandaperera nirandaperera changed the title Re-enable chunked pack benchmarks Re-enable and extend chunked pack benchmarks Jan 13, 2026
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

nirandaperera commented Jan 13, 2026

The latest results on my workstation. I need to verify the results in a H100 machine as well

image

);
RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync(
static_cast<std::uint8_t*>(destination.data()) + offset,
reinterpret_cast<std::uint8_t*>(destination.data()) + offset,
Copy link
Member

@madsbk madsbk Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't destination.data() a std::byte* already?

@nirandaperera
Copy link
Contributor Author

Following are the results from PDX H100 nodes (with cuda-compat 13.1)

image

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

nirandaperera commented Jan 14, 2026

I think I found the issue for the poor cudf::pack performance a pinned mr. It turns out that pinned memory pool is VERY slow to start off, so the first bench iteration takes ~1s. Since I have not set any min limits, it stops there and only reports the 1st iteration results 😢 When I set min_time=4s warm_up_time=1s this discrepancy (falling off a cliff) falls away.

Updated results

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

// Warm up
auto warm_up = cudf::pack(table.view(), stream, pack_mr);

rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.

Comment on lines 191 to 193
rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{
cuda_mr, rmm::percent_of_free_device_memory(40)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout, this is a mad memory resource, and not one we ever use.

Just use the async memory resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269

Comment on lines 366 to 368
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…nch-pack

Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
@nirandaperera
Copy link
Contributor Author

@wence- @madsbk I think I have added all the combinations (almost) now. Can you take another look?

@nirandaperera
Copy link
Contributor Author

Latest results.

My revised conclusions are (based on destination reservation),

For variable sized destination buffers

  1. device
  • Use pack directly. Its significantly faster that chunked_pack . Since the output reservation is provided, we can assume that there is at least O(table size) amount of device memory is available.
  • chunked_pack to destination buffer offsets reaches pack perf for larger tables. So maybe we can think about this for larger tables?
  1. pinned host
  • pack using a pinned mr is faster for smaller tables (<100MB). But chunked_pack to destination pinned buffer offsets is faster and stable for larger tables (this maybe because pack destination buffer allocation time is included in the timings)
  • pack to device and copying to pinned buffer/ chunked_pack to device bounce buffer have comparable performance, but not greater.
  1. host
  • pack to device and copying/ chunked_pack to device bounce buffer have very similar performance.
    similarly, pack to pinned and copying/ chunked_pack to pinned bounce buffer have very similar performance, but slower than device bounce buffers
  • So, we can use chunked_pack always, and pick the bounce buffer based on availability.

The impact of the bounce buffer size in chunked_pack is shown here.

In H100, If the destination buffer is,

  • pinned host - perf increases and saturates around 21GB/s for >8MB bounce buffers for a 1GB table
  • device - perf increases and saturates around 600GB/s for ~512MB bounce buffers for a 1GB table (so, essentially this reached pack )
    In both cases, smaller bounce buffers sizes yield subpar performance.

So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
We could,

  • Increase the pinned buffer sizes to ~4MB (this would be in-efficient for smaller buffers)
  • Pack to pinned/ device memory (if available), and then async batch-copy to smaller buffer.

* @param state The benchmark state
* @param table_size_mb The size of the table in MB
* @param table_mr The memory resource for the table
* @param pack_mr The memory resource for the packed data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param pack_mr The memory resource for the packed data
* @param pack_mr The memory resource for the packed data
* @param dest_mr The memory resource for the destination data

Comment on lines 366 to 368
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.

Comment on lines +380 to +381
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.

Comment on lines +407 to +408
// Bounce buffer size: max(1MB, table_size / 10)
auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

std::size_t table_size,
rmm::device_async_resource_ref table_mr,
rmm::device_async_resource_ref pack_mr,
auto& dest_mr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why auto and not explicit like others?

Comment on lines +492 to +493
state.counters["bounce_buffer_mb"] =
static_cast<double>(bounce_buffer_size) / static_cast<double>(MB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?

* @param b The benchmark to configure with arguments.
*/
void PackArguments(benchmark::internal::Benchmark* b) {
// Test different table sizes in MB (minimum 1MB as requested)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants