Conversation
rjzamora
commented
Dec 3, 2025
- Based on TPCH-derived Q9 #663
- Related to Manually Construct Multi-GPU C++ TPC-H Queries #693
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
[EDIT: This is no longer the network design] |
|
Note that because you're shuffling the lineitem table, I think the fanout can be bounded rather than unbounded. |
I used cursor to generate the diagram, and so it's a bit hard to tell: I think we do need an unbounded fanout after the lineitem shuffle, because the first consumer immediately pulls all of the output into a groupy, while the other consumer isn't used at all until after the output of that groupby operation is joined with the orders table. The problem is that the other consumer is joined to the output of that groupby-orders join. This effectively requires us to cache the entirety of the lineitem shuffle somwhere until after the groupby-oders join has started. |
|
Added a For sf3k on Viking (8xH100), the first two iterations take about 21s and 6s, respectively: Output |
|
@rjzamora It's best rebasing this with main ASAP. Otherwise it could be a merge hell later on. |
| ctx, right_msg.release<rapidsmpf::streaming::TableChunk>() | ||
| ); | ||
|
|
||
| auto stream = left_chunk.stream(); |
There was a problem hiding this comment.
My guess is that we need to do something to ensure that right_table is valid on stream before performing the cudf::hash_join on stream. We do have rapidsmpf::cuda_stream_join, which might work.
(as an aside, I think that this is one of the issues compute-sanitizer is supposed to help with).
There was a problem hiding this comment.
Okay yeah, I added cuda_stream_join, but I'll probably need to use compute-sanitizer or something to check if the sequence of operations is "correct"
| auto match = joiner.semi_join( | ||
| table.select({key_column_idx}), chunk_stream, ctx->br()->device_mr() | ||
| ); |
There was a problem hiding this comment.
Just an observation: this differs from Q4. In Q4 have a relatively left / probe table, and a relatively large right / build table.
|
We'll need to sync this (and the other by-hand implementations) with #731, which changes how BufferResources are created. |
| for (auto&& r : results) { | ||
| std::ranges::move(r.results, std::back_inserter(cols)); | ||
| } | ||
| local_result = std::make_unique<cudf::table>(std::move(cols)); |
There was a problem hiding this comment.
should we release table at this point?
There was a problem hiding this comment.
I tried to do this by moving the chunk in a local context after this. Let me know if there is a better way.
| auto sorted = cudf::sort_by_key( | ||
| global_result->view(), | ||
| global_result->view().select({4, 3}), // o_totalprice DESC, o_orderdate ASC | ||
| {cudf::order::DESCENDING, cudf::order::ASCENDING}, | ||
| {cudf::null_order::AFTER, cudf::null_order::AFTER}, | ||
| stream, | ||
| ctx->br()->device_mr() | ||
| ); | ||
| cudf::size_type limit = std::min(100, sorted->num_rows()); | ||
| auto limited = cudf::slice(sorted->view(), {0, limit})[0]; |
There was a problem hiding this comment.
Shouldnt we use cudf::top_k here? It would be more efficient I think.
Do we need the top 100 to be in order?
There was a problem hiding this comment.
I couldn't figure out a clean way to use cudf::top_k_order when we are sorting on multiple columns. We may may be able to tackle it in two phases, but I'm pretty sure we are dealing with a relatively-small number of columns at this point anyway (even at sf3000). Perhaps if further profiling suggests I'm wrong about this, we can revisit?
|
@rjzamora rjzamora#1 has the changes to use some of the streaming groupby / join utilities that we have on |