Adding chunked_pack to pack table chunks when device memory is insufficient#726
Adding chunked_pack to pack table chunks when device memory is insufficient#726nirandaperera wants to merge 19 commits intorapidsai:mainfrom
Conversation
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: niranda perera <niranda.perera@gmail.com>
| // TODO: there is a possibility that bounce buffer destructor is a called before the | ||
| // async copies are completed. Should we synchronize the stream here? | ||
| bounce_buf->unlock(); |
There was a problem hiding this comment.
The operations here are stream ordered. As long as the stream used by table is the same as the stream used by the bounce buffer, everything is correct.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
madsbk
left a comment
There was a problem hiding this comment.
Thanks @nirandaperera, overall looks good.
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
@madsbk Can you take another look 🙂 |
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; | ||
| RAPIDSMPF_EXPECTS( | ||
| avail_dev_mem > 1 << 20, | ||
| "not enough device memory for the bounce buffer", | ||
| std::runtime_error | ||
| ); | ||
| auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res); | ||
|
|
||
| packed_data = std::make_unique<PackedData>( | ||
| chunked_pack(table_view(), *bounce_buf, reservation) | ||
| ); | ||
| } else { | ||
| // if there is enough memory to pack the table, use `cudf::pack` | ||
| auto packed_columns = | ||
| cudf::pack(table_view(), stream(), br->device_mr()); | ||
| // clear the reservation as we are done with it. | ||
| pack_res.clear(); | ||
| packed_data = std::make_unique<PackedData>( | ||
| std::move(packed_columns.metadata), | ||
| br->move(std::move(packed_columns.gpu_data), stream()) | ||
| ); | ||
|
|
||
| // Handle the case where `cudf::pack` allocates slightly more than | ||
| // the input size. This can occur because cudf uses aligned | ||
| // allocations, which may exceed the requested size. To | ||
| // accommodate this, we allow some wiggle room. | ||
| if (packed_data->data->size > reservation.size()) { | ||
| if (packed_data->data->size | ||
| <= reservation.size() | ||
| + total_packing_wiggle_room(table_view())) | ||
| { | ||
| reservation = | ||
| br->reserve( | ||
| MemoryType::HOST, packed_data->data->size, true | ||
| ) | ||
| .first; | ||
| } |
There was a problem hiding this comment.
Can we move this to a helper function, I think nesting becomes a problem.
It would also be worth exploring if we even need the non-chunked version?
There was a problem hiding this comment.
Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?
Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.
There was a problem hiding this comment.
Okay, let me add a benchmark.
| * @return The total amount of extra memory to reserve for packing. | ||
| */ | ||
| inline size_t total_packing_wiggle_room(cudf::table_view const& table) { | ||
| return packing_wiggle_room_per_column * static_cast<size_t>(table.num_columns()); |
There was a problem hiding this comment.
note: This is likely not enough if the table has many nested columns.
There was a problem hiding this comment.
Sure. I was just mimicking the current impl we have here
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/src/streaming/cudf/table_chunk.cpp#L187-L189
| // all copies are done on the same stream, so we can omit the stream parameter | ||
| cudf::device_span<uint8_t> buf_span( | ||
| reinterpret_cast<uint8_t*>(bounce_buf_ptr), chunk_size |
There was a problem hiding this comment.
Let use use cuda::std::span.
Also the comment is meaningless, device_span has no stream parameter to its ctor.
There was a problem hiding this comment.
The comment refers to the stream_view arg in L299 😇
| .first; | ||
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; |
There was a problem hiding this comment.
issue: Integer overflow. suppose we can't make a reservation, so pack_res.size() is zero, and overbooking is estimated_memory_usage(...). Then this will be (size_t)(-overbooking).
There was a problem hiding this comment.
@wence- I'm not sure if this is true. Reservation is made with overbooking
https://github.com/nirandaperera/rapidsmpf/blob/Make-unbounded-fanout-state-spillable/cpp/src/memory/buffer_resource.cpp#L58-L64
So, it will be
{MemoryReservation(mem_type, this, size), overbooking};I think the case you are referring to, is for without overbooking
There was a problem hiding this comment.
Overbooking is positive. Can it happen that overbooking is larger than pack_res.size()? If yes, then you have integer overflow.
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; | ||
| RAPIDSMPF_EXPECTS( | ||
| avail_dev_mem > 1 << 20, | ||
| "not enough device memory for the bounce buffer", | ||
| std::runtime_error | ||
| ); | ||
| auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res); | ||
|
|
||
| packed_data = std::make_unique<PackedData>( | ||
| chunked_pack(table_view(), *bounce_buf, reservation) | ||
| ); | ||
| } else { | ||
| // if there is enough memory to pack the table, use `cudf::pack` | ||
| auto packed_columns = | ||
| cudf::pack(table_view(), stream(), br->device_mr()); | ||
| // clear the reservation as we are done with it. | ||
| pack_res.clear(); | ||
| packed_data = std::make_unique<PackedData>( | ||
| std::move(packed_columns.metadata), | ||
| br->move(std::move(packed_columns.gpu_data), stream()) | ||
| ); | ||
|
|
||
| // Handle the case where `cudf::pack` allocates slightly more than | ||
| // the input size. This can occur because cudf uses aligned | ||
| // allocations, which may exceed the requested size. To | ||
| // accommodate this, we allow some wiggle room. | ||
| if (packed_data->data->size > reservation.size()) { | ||
| if (packed_data->data->size | ||
| <= reservation.size() | ||
| + total_packing_wiggle_room(table_view())) | ||
| { | ||
| reservation = | ||
| br->reserve( | ||
| MemoryType::HOST, packed_data->data->size, true | ||
| ) | ||
| .first; | ||
| } |
There was a problem hiding this comment.
Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?
Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com> Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
This PR enables using
cudf::chunked_packwhen copying aTableChunkto Host memory, IF there is not enough device memory for acudf::packoperation.