Re-enable and extend chunked pack benchmarks#781
Re-enable and extend chunked pack benchmarks#781nirandaperera wants to merge 18 commits intorapidsai:mainfrom
Conversation
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
cpp/benchmarks/bench_pack.cpp
Outdated
| ); | ||
| RAPIDSMPF_CUDA_TRY(cudaMemcpyAsync( | ||
| static_cast<std::uint8_t*>(destination.data()) + offset, | ||
| reinterpret_cast<std::uint8_t*>(destination.data()) + offset, |
There was a problem hiding this comment.
isn't destination.data() a std::byte* already?
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…mpf into reenable-bench-pack
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
I think I found the issue for the poor |
| // Warm up | ||
| auto warm_up = cudf::pack(table.view(), stream, pack_mr); | ||
|
|
||
| rapidsmpf::HostBuffer dest(warm_up.gpu_data->size(), stream, dest_mr); |
There was a problem hiding this comment.
You probably want to ensure the dest is paged in by also memcopying into it from the packed warm_up buffer.
cpp/benchmarks/bench_pack.cpp
Outdated
| rmm::mr::pool_memory_resource<rmm::mr::cuda_async_memory_resource> pool_mr{ | ||
| cuda_mr, rmm::percent_of_free_device_memory(40) | ||
| }; |
There was a problem hiding this comment.
Throughout, this is a mad memory resource, and not one we ever use.
Just use the async memory resource.
There was a problem hiding this comment.
@wence- Hmm? Aren't we using pool as the default mr in bench shuffle?
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/benchmarks/bench_shuffle.cpp#L269
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); | ||
|
|
There was a problem hiding this comment.
I think this is a bad model for the bounce buffer size. I don't think we want to scale it with the table size, but rather have a fixed size bounce buffer. That way, if we're using a putative fixed size pinned host resource each chunk neatly fits into a block from that host resource.
There was a problem hiding this comment.
Okay, let me run with a fixed sized bounce buffer. The reason why I didnt go ahead with that previously was, fixed size buffer resource proposed in cucascade is 1MB. I felt like there will be too many calls to cudamemcpyasync. But I should have run a benchmark, rather than assuming things.
There was a problem hiding this comment.
I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
…nch-pack Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
Latest results. My revised conclusions are (based on destination reservation), For variable sized destination buffers
The impact of the bounce buffer size in chunked_pack is shown here. In H100, If the destination buffer is,
So, my take here is, we can not rely on chunked_pack to directly pack into small (1MB) perallocated fixed sized pinned pools.
|
| * @param state The benchmark state | ||
| * @param table_size_mb The size of the table in MB | ||
| * @param table_mr The memory resource for the table | ||
| * @param pack_mr The memory resource for the packed data |
There was a problem hiding this comment.
| * @param pack_mr The memory resource for the packed data | |
| * @param pack_mr The memory resource for the packed data | |
| * @param dest_mr The memory resource for the destination data |
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); | ||
|
|
There was a problem hiding this comment.
I think this is still unresolved, or did the code change here? Still looks like a variable bounce buffer size based on the input size, where Lawrence suggests using a fix buffer size which I agree seems more realistic an easier to manage.
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); |
There was a problem hiding this comment.
We probably want a similar model for bounce buffers as above. Once the model is decided, maybe make it a function to compute the result or use a constant so the same is used everywhere.
| // Bounce buffer size: max(1MB, table_size / 10) | ||
| auto const bounce_buffer_size = std::max(MB, table_size_bytes / 10); |
| std::size_t table_size, | ||
| rmm::device_async_resource_ref table_mr, | ||
| rmm::device_async_resource_ref pack_mr, | ||
| auto& dest_mr, |
There was a problem hiding this comment.
Why auto and not explicit like others?
| state.counters["bounce_buffer_mb"] = | ||
| static_cast<double>(bounce_buffer_size) / static_cast<double>(MB); |
There was a problem hiding this comment.
This function is named run_chunked_pack_without_bounce_buffer, but here's a bounce buffer size, why?
| * @param b The benchmark to configure with arguments. | ||
| */ | ||
| void PackArguments(benchmark::internal::Benchmark* b) { | ||
| // Test different table sizes in MB (minimum 1MB as requested) |
There was a problem hiding this comment.
As requested by whom/where? I was going to ask if there's use for smaller sizes too to show cases where performance is bad and should be avoided (if that's the case).


Since NVIDIA/cccl#7006 is merged, we should be able to reenable chunked pack benchmarks that use pinned memory.
This PR also extends the pack benchmarks to copy the packed data to a destination HostBuffer (host/ pinned). This gives a more appropriate picture for spilling.
Latest results
https://docs.google.com/spreadsheets/d/1yXiB3aFZO8GUD4dAVnwh7o9zjzXKwaSvGYGZphf9jgQ/edit?usp=sharing
Previous results
Workstation RTX A6000 driver 580.105.08 CUDA 13.0

PDX H100 driver 535.216.03 CUDA 13.1 (using cuda-compat)

Looking at these results, if we consider the spilling scenario where we pack and copy to host/ pinned host memory, for a 1GB table,