TPCH-derived Q18 by rjzamora · Pull Request #705 · rapidsai/rapidsmpf

rjzamora · 2025-12-03T21:17:49Z

Based on TPCH-derived Q9 #663
Related to Manually Construct Multi-GPU C++ TPC-H Queries #693

copy-pr-bot · 2025-12-03T21:17:53Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rjzamora · 2025-12-03T21:19:15Z

[EDIT: This is no longer the network design]
Current network design (Only tested on 1xV100 at scale-factor 10 through 300):

┌──────────────────────────────────────────────────────────────────────────────┐
│                              DATA SOURCES                                    │
├────────────────────┬────────────────────────┬────────────────────────────────┤
│   Read Lineitem    │     Read Orders        │      Read Customer             │
│  (l_orderkey,      │    (o_orderkey,        │     (c_custkey,                │
│   l_quantity)      │     o_custkey, ...)    │      c_name)                   │
└─────────┬──────────┴───────────┬────────────┴──────────────┬─────────────────┘
          │                      │                           │
          ▼                      │                           │
┌─────────────────────┐          │                           │
│    Shuffle (1)      │          │                           │
│   by l_orderkey     │          │                           │
│   16 partitions     │          │                           │
└─────────┬───────────┘          │                           │
          │                      │                           │
          ▼                      │                           │
┌─────────────────────┐          │                           │
│   Fanout (UNBND)    │          │                           │
└────┬───────────┬────┘          │                           │
     │           │               │                           │
     │           ▼               │                           │
     │    ┌─────────────────┐    │                           │
     │    │spillable_buffer │    │                           │
     │    │(can be spilled) │    │                           │
     │    └────────┬────────┘    │                           │
     │             │             │                           │
     ▼             │             ▼                           │
┌──────────┐       │      ┌─────────────────────┐            │
│ Groupby  │       │      │    Shuffle (2)      │            │
│ +Filter  │       │      │   by o_orderkey     │            │
│ sum>300  │       │      │   16 partitions     │            │
└────┬─────┘       │      └──────────┬──────────┘            │
     │             │                 │                       │
     │  (A)        │  (B)            │  (C)                  │
     └─────────────┼─────────────────┘                       │
                   │                                         │
                   ▼                                         │
          ┌────────────────────┐                             │
          │ Shuffle Semi-Join  │                             │
          │   (A) ⋈ (C)        │                             │
          │ partition-by-part  │                             │
          └─────────┬──────────┘                             │
                    │                                        │
                    │  (D) orders_filtered                   │
                    │                                        │
                    └────────────────┐                       │
                                     │                       │
                                     ▼                       │
                          ┌──────────────────────┐           │
                          │ Shuffle Inner Join   │           │
                          │   (D) ⋈ (B)          │           │
                          │ partition-by-part    │           │
                          └──────────┬───────────┘           │
                                     │                       │
                                     │  (E) orders_x_lineitem│
                                     │                       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │ Broadcast Inner Join  │
                                     │   Customer ⋈ (E)      │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Reorder Columns     │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Chunkwise Groupby   │
                                     │   (partial agg)       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │     Concatenate       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │    Final Groupby      │
                                     │   (merge partials)    │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Sort + Limit 100    │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │    Write Parquet      │
                                     └───────────────────────┘

wence- · 2025-12-04T10:31:28Z

Note that because you're shuffling the lineitem table, I think the fanout can be bounded rather than unbounded.

rjzamora · 2025-12-04T13:41:04Z

Note that because you're shuffling the lineitem table, I think the fanout can be bounded rather than unbounded.

I used cursor to generate the diagram, and so it's a bit hard to tell: I think we do need an unbounded fanout after the lineitem shuffle, because the first consumer immediately pulls all of the output into a groupy, while the other consumer isn't used at all until after the output of that groupby operation is joined with the orders table. The problem is that the other consumer is joined to the output of that groupby-orders join. This effectively requires us to cache the entirety of the lineitem shuffle somwhere until after the groupby-oders join has started.

rjzamora · 2025-12-04T19:34:12Z

Added a q18_prefilter variation that executes Query 18 in two sequential pipelines. The first performs a groupby aggregation on the lineitems table to find the set of allowed order keys. Then the second phase applies this orderkey filter to every partition before shuffling or joining anything (dramatically reducing the amount of data we need to move around). This physical plan is probably difficult to generate automatically, but it probably gives us a good SOL reference.

For sf3k on Viking (8xH100), the first two iterations take about 21s and 6s, respectively:

FORCE_UCX_NET_DEVICES=unset LIBCUDF_NUM_HOST_WORKERS=4 KVIKIO_NTHREADS=4 UCX_TLS=^sm UCX_MAX_RNDV_RAILS=1 mpiexec -n 8 ./binder.sh ${BENCH_DIR}/q18_prefilter --input-directory $DATA_DIR --output-file q18-debug-sf3000.pq --spill-device-limit 0.9 --use-shuffle --num-partitions 256

Output

MPI Rank 5 host viking-prod-206 GPU=5 cores=70,71,72,73,74,75,76,77,78,79,80,81,82,83 membind=1 UCX_NET_DEVICES=
MPI Rank 4 host viking-prod-206 GPU=4 cores=56,57,58,59,60,61,62,63,64,65,66,67,68,69 membind=1 UCX_NET_DEVICES=
MPI Rank 0 host viking-prod-206 GPU=0 cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13 membind=0 UCX_NET_DEVICES=
MPI Rank 6 host viking-prod-206 GPU=6 cores=84,85,86,87,88,89,90,91,92,93,94,95,96,97 membind=1 UCX_NET_DEVICES=
MPI Rank 7 host viking-prod-206 GPU=7 cores=98,99,100,101,102,103,104,105,106,107,108,109,110,111 membind=1 UCX_NET_DEVICES=
MPI Rank 3 host viking-prod-206 GPU=3 cores=42,43,44,45,46,47,48,49,50,51,52,53,54,55 membind=0 UCX_NET_DEVICES=
MPI Rank 2 host viking-prod-206 GPU=2 cores=28,29,30,31,32,33,34,35,36,37,38,39,40,41 membind=0 UCX_NET_DEVICES=
MPI Rank 1 host viking-prod-206 GPU=1 cores=14,15,16,17,18,19,20,21,22,23,24,25,26,27 membind=0 UCX_NET_DEVICES=
[PRINT:2:0] Q18 Pre-filter Benchmark
[PRINT:2:0] Executor has 4 threads
[PRINT:2:0] Executor has 8 ranks
[PRINT:2:0] Mode: shuffle, partitions: 256
[PRINT:2:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Q18 Pre-filter Benchmark
[PRINT:3:0] Executor has 4 threads
[PRINT:3:0] Executor has 8 ranks
[PRINT:3:0] Mode: shuffle, partitions: 256
[PRINT:3:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:5:0] Q18 Pre-filter Benchmark
[PRINT:5:0] Executor has 4 threads
[PRINT:5:0] Executor has 8 ranks
[PRINT:5:0] Mode: shuffle, partitions: 256
[PRINT:5:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:7:0] Q18 Pre-filter Benchmark
[PRINT:7:0] Executor has 4 threads
[PRINT:7:0] Executor has 8 ranks
[PRINT:7:0] Mode: shuffle, partitions: 256
[PRINT:7:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:0] Concatenate
[PRINT:4:0] Q18 Pre-filter Benchmark
[PRINT:4:0] Executor has 4 threads
[PRINT:4:0] Executor has 8 ranks
[PRINT:4:0] Mode: shuffle, partitions: 256
[PRINT:4:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:1] Shuffle 100
[PRINT:1:0] Q18 Pre-filter Benchmark
[PRINT:1:0] Executor has 4 threads
[PRINT:1:0] Executor has 8 ranks
[PRINT:1:0] Mode: shuffle, partitions: 256
[PRINT:1:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Concatenate
[PRINT:3:1] Shuffle 100
[PRINT:5:0] Concatenate
[PRINT:5:1] Shuffle 100
[PRINT:7:0] Concatenate
[PRINT:4:0] Concatenate
[PRINT:7:1] Shuffle 100
[PRINT:4:1] Shuffle 100
[PRINT:1:0] Concatenate
[PRINT:1:1] Shuffle 100
[PRINT:6:0] Q18 Pre-filter Benchmark
[PRINT:6:0] Executor has 4 threads
[PRINT:6:0] Executor has 8 ranks
[PRINT:6:0] Mode: shuffle, partitions: 256
[PRINT:6:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:0:0] Q18 Pre-filter Benchmark
[PRINT:0:0] Executor has 4 threads
[PRINT:0:0] Executor has 8 ranks
[PRINT:0:0] Mode: shuffle, partitions: 256
[PRINT:0:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:6:0] Concatenate
[PRINT:0:0] Concatenate
[PRINT:0:1] Shuffle 100
[PRINT:6:1] Shuffle 100
[PRINT:5:2] chunkwise_groupby: rank processed 8 chunks, 743997041 -> 186000000 rows
[PRINT:0:1] chunkwise_groupby: rank processed 8 chunks, 755944928 -> 189000000 rows
[PRINT:2:1] chunkwise_groupby: rank processed 8 chunks, 755998206 -> 189000000 rows
[PRINT:7:2] chunkwise_groupby: rank processed 8 chunks, 743997625 -> 186000000 rows
[PRINT:6:2] chunkwise_groupby: rank processed 8 chunks, 743997701 -> 186000000 rows
[PRINT:3:2] chunkwise_groupby: rank processed 8 chunks, 756024820 -> 189000000 rows
[PRINT:4:2] chunkwise_groupby: rank processed 8 chunks, 743979026 -> 186000000 rows
[PRINT:1:2] chunkwise_groupby: rank processed 8 chunks, 756050362 -> 189000000 rows
[PRINT:2:2] final_groupby_filter: rank processing 187489000 partial aggregates
[PRINT:2:2] final_groupby_filter: rank merged to 187489000 unique orderkeys
[PRINT:2:2] final_groupby_filter: 7939 qualifying orderkeys (sum > 300)
[PRINT:3:3] final_groupby_filter: rank processing 187488260 partial aggregates
[PRINT:0:2] final_groupby_filter: rank processing 187513727 partial aggregates
[PRINT:0:2] final_groupby_filter: rank merged to 187513727 unique orderkeys
[PRINT:3:3] final_groupby_filter: rank merged to 187488260 unique orderkeys
[PRINT:6:2] final_groupby_filter: rank processing 187529132 partial aggregates
[PRINT:0:2] final_groupby_filter: 8109 qualifying orderkeys (sum > 300)
[PRINT:3:3] final_groupby_filter: 7769 qualifying orderkeys (sum > 300)
[PRINT:5:3] final_groupby_filter: rank processing 187502573 partial aggregates
[PRINT:4:3] final_groupby_filter: rank processing 187492846 partial aggregates
[PRINT:1:3] final_groupby_filter: rank processing 187495546 partial aggregates
[PRINT:4:3] final_groupby_filter: rank merged to 187492846 unique orderkeys
[PRINT:6:2] final_groupby_filter: rank merged to 187529132 unique orderkeys
[PRINT:4:3] final_groupby_filter: 7878 qualifying orderkeys (sum > 300)
[PRINT:6:2] final_groupby_filter: 8028 qualifying orderkeys (sum > 300)
[PRINT:5:3] final_groupby_filter: rank merged to 187502573 unique orderkeys
[PRINT:7:1] final_groupby_filter: rank processing 187488916 partial aggregates
[PRINT:1:3] final_groupby_filter: rank merged to 187495546 unique orderkeys
[PRINT:5:3] final_groupby_filter: 7976 qualifying orderkeys (sum > 300)
[PRINT:1:3] final_groupby_filter: 7857 qualifying orderkeys (sum > 300)
[PRINT:7:1] final_groupby_filter: rank merged to 187488916 unique orderkeys
[PRINT:7:1] final_groupby_filter: 7874 qualifying orderkeys (sum > 300)
[PRINT:5:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:7:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:0:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:2:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:3:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:4:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Concatenate
[PRINT:5:4] Shuffle 200
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Concatenate
[PRINT:6:1] Shuffle 200
[PRINT:5:4] Shuffle 201
[PRINT:6:1] Shuffle 201
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Inner shuffle join
[PRINT:1:3] Shuffle 200
[PRINT:5:4] Shuffle 202
[PRINT:1:0] Concatenate
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Concatenate
[PRINT:6:1] Shuffle 202
[PRINT:1:3] Shuffle 201
[PRINT:7:1] Shuffle 200
[PRINT:5:4] Shuffle 203
[PRINT:1:3] Shuffle 202
[PRINT:6:1] Shuffle 203
[PRINT:1:3] Shuffle 203
[PRINT:0:0] Inner shuffle join
[PRINT:0:0] Inner shuffle join
[PRINT:0:2] Shuffle 200
[PRINT:0:0] Concatenate
[PRINT:7:1] Shuffle 201
[PRINT:0:2] Shuffle 201
[PRINT:7:1] Shuffle 202
[PRINT:0:2] Shuffle 202
[PRINT:7:1] Shuffle 203
[PRINT:0:2] Shuffle 203
[PRINT:4:0] Inner shuffle join
[PRINT:4:1] Shuffle 200
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Concatenate
[PRINT:4:1] Shuffle 201
[PRINT:4:1] Shuffle 202
[PRINT:4:1] Shuffle 203
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Concatenate
[PRINT:2:2] Shuffle 200
[PRINT:2:2] Shuffle 201
[PRINT:2:2] Shuffle 202
[PRINT:2:2] Shuffle 203
[PRINT:3:0] Inner shuffle join
[PRINT:3:0] Inner shuffle join
[PRINT:3:3] Shuffle 200
[PRINT:3:0] Concatenate
[PRINT:3:3] Shuffle 201
[PRINT:3:3] Shuffle 202
[PRINT:3:3] Shuffle 203
[PRINT:3:1] prefilter: rank processed 192000000 -> 8101 rows (0.00421927%)
[PRINT:0:3] prefilter: rank processed 192000000 -> 8202 rows (0.00427187%)
[PRINT:5:2] prefilter: rank processed 180000000 -> 7700 rows (0.00427778%)
[PRINT:3:3] prefilter: rank processed 756024820 -> 55881 rows (0.00739142%)
[PRINT:0:3] prefilter: rank processed 755944928 -> 55951 rows (0.00740147%)
[PRINT:1:4] prefilter: rank processed 192000000 -> 8151 rows (0.00424531%)
[PRINT:1:2] prefilter: rank processed 756050362 -> 55545 rows (0.00734673%)
[PRINT:5:4] prefilter: rank processed 743997041 -> 55258 rows (0.00742718%)
[PRINT:4:4] prefilter: rank processed 192000000 -> 8058 rows (0.00419688%)
[PRINT:4:2] prefilter: rank processed 743979026 -> 55433 rows (0.00745088%)
[PRINT:2:3] prefilter: rank processed 192000000 -> 8050 rows (0.00419271%)
[PRINT:2:2] prefilter: rank processed 755998206 -> 55685 rows (0.00736576%)
[PRINT:7:1] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:3] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:1] prefilter: rank processed 743997701 -> 55594 rows (0.00747233%)
[PRINT:7:1] prefilter: rank processed 743997625 -> 54663 rows (0.0073472%)
[PRINT:2:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8914
[PRINT:2:0] Iteration 0 Phase 2 build [s]: 0.00196167
[PRINT:2:0] Iteration 0 Phase 2 run [s]: 3.81605
[PRINT:2:0] Iteration 0 TOTAL [s]: 21.7094
[PRINT:2:0] Statistics:
 - event-loop-check-future-finish:       249.18 ms (avg 962.32 ns)
 - event-loop-init-gpu-data-send:        874.20 ms (avg 3.38 us)
 - event-loop-metadata-recv:             285.68 ms (avg 1.10 us)
 - event-loop-metadata-send:             1.85 s (avg 7.16 us)
 - event-loop-post-incoming-chunk-recv:  3.28 s (avg 12.65 us)
 - event-loop-total:                     13.43 s (avg 25.95 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 339.34 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 339.86 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            9.34 ms (avg 58.02 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   15.12 GiB   15.12 GiB  101.71 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:7:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8889
[PRINT:7:0] Iteration 0 Phase 2 build [s]: 0.00240772
[PRINT:7:0] Iteration 0 Phase 2 run [s]: 3.81773
[PRINT:7:0] Iteration 0 TOTAL [s]: 21.709
[PRINT:1:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8884
[PRINT:1:0] Iteration 0 Phase 2 build [s]: 0.00239001
[PRINT:1:0] Iteration 0 Phase 2 run [s]: 3.81712
[PRINT:1:0] Iteration 0 TOTAL [s]: 21.7079
[PRINT:7:0] Statistics:
 - event-loop-check-future-finish:       161.76 ms (avg 639.00 ns)
 - event-loop-init-gpu-data-send:        1.94 s (avg 7.67 us)
 - event-loop-metadata-recv:             879.34 ms (avg 3.47 us)
 - event-loop-metadata-send:             1.81 s (avg 7.16 us)
 - event-loop-post-incoming-chunk-recv:  2.64 s (avg 10.45 us)
 - event-loop-total:                     15.20 s (avg 30.89 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 337.05 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 338.33 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.52 ms (avg 40.49 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.36 GiB   12.36 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8916
[PRINT:3:0] Iteration 0 Phase 2 build [s]: 0.00244337
[PRINT:3:0] Iteration 0 Phase 2 run [s]: 3.81564
[PRINT:3:0] Iteration 0 TOTAL [s]: 21.7097
[PRINT:1:0] Statistics:
 - event-loop-check-future-finish:       161.23 ms (avg 473.39 ns)
 - event-loop-init-gpu-data-send:        2.31 s (avg 6.80 us)
 - event-loop-metadata-recv:             766.09 ms (avg 2.25 us)
 - event-loop-metadata-send:             684.58 ms (avg 2.01 us)
 - event-loop-post-incoming-chunk-recv:  3.31 s (avg 9.73 us)
 - event-loop-total:                     14.95 s (avg 25.21 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.08 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 340.39 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.26 ms (avg 38.86 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.83 GiB   12.83 GiB  101.72 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.45 MiB   89.45 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Statistics:
 - event-loop-check-future-finish:       160.08 ms (avg 444.45 ns)
 - event-loop-init-gpu-data-send:        1.75 s (avg 4.85 us)
 - event-loop-metadata-recv:             501.98 ms (avg 1.39 us)
 - event-loop-metadata-send:             2.71 s (avg 7.51 us)
 - event-loop-post-incoming-chunk-recv:  726.99 ms (avg 2.02 us)
 - event-loop-total:                     12.14 s (avg 19.29 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.49 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 342.25 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            9.26 ms (avg 57.52 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.06 GiB   11.06 GiB  101.71 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8773
[PRINT:6:0] Iteration 0 Phase 2 build [s]: 0.00248518
[PRINT:6:0] Iteration 0 Phase 2 run [s]: 3.81825
[PRINT:6:0] Iteration 0 TOTAL [s]: 21.698
[PRINT:4:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8908
[PRINT:4:0] Iteration 0 Phase 2 build [s]: 0.00199225
[PRINT:4:0] Iteration 0 Phase 2 run [s]: 3.81637
[PRINT:4:0] Iteration 0 TOTAL [s]: 21.7092
[PRINT:5:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8887
[PRINT:5:0] Iteration 0 Phase 2 build [s]: 0.00241324
[PRINT:5:0] Iteration 0 Phase 2 run [s]: 3.81878
[PRINT:5:0] Iteration 0 TOTAL [s]: 21.7099
[PRINT:6:0] Statistics:
 - event-loop-check-future-finish:       141.83 ms (avg 552.02 ns)
 - event-loop-init-gpu-data-send:        2.76 s (avg 10.76 us)
 - event-loop-metadata-recv:             866.96 ms (avg 3.37 us)
 - event-loop-metadata-send:             1.70 s (avg 6.60 us)
 - event-loop-post-incoming-chunk-recv:  1.45 s (avg 5.63 us)
 - event-loop-total:                     14.43 s (avg 28.22 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.22 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 334.54 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            7.29 ms (avg 45.25 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   13.33 GiB   13.33 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:4:0] Statistics:
 - event-loop-check-future-finish:       198.81 ms (avg 673.50 ns)
 - event-loop-init-gpu-data-send:        1.51 s (avg 5.10 us)
 - event-loop-metadata-recv:             297.60 ms (avg 1.01 us)
 - event-loop-metadata-send:             1.84 s (avg 6.23 us)
 - event-loop-post-incoming-chunk-recv:  2.66 s (avg 9.01 us)
 - event-loop-total:                     13.46 s (avg 23.98 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.67 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 337.34 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.07 ms (avg 37.69 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   10.89 GiB   10.89 GiB  100.65 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Statistics:
 - event-loop-check-future-finish:       155.79 ms (avg 475.41 ns)
 - event-loop-init-gpu-data-send:        2.32 s (avg 7.07 us)
 - event-loop-metadata-recv:             991.63 ms (avg 3.03 us)
 - event-loop-metadata-send:             1.93 s (avg 5.90 us)
 - event-loop-post-incoming-chunk-recv:  617.86 ms (avg 1.89 us)
 - event-loop-total:                     12.47 s (avg 20.70 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 339.03 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 335.48 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.32 ms (avg 39.25 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1    9.31 GiB    9.31 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:0:2] Wrote 100 rows to q18-debug-sf1000.pq
[PRINT:0:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8781
[PRINT:0:0] Iteration 0 Phase 2 build [s]: 0.00229348
[PRINT:0:0] Iteration 0 Phase 2 run [s]: 4.09129
[PRINT:0:0] Iteration 0 TOTAL [s]: 21.9717
[PRINT:0:0] Statistics:
 - event-loop-check-future-finish:       496.58 ms (avg 1.26 us)
 - event-loop-init-gpu-data-send:        2.83 s (avg 7.17 us)
 - event-loop-metadata-recv:             265.08 ms (avg 670.85 ns)
 - event-loop-metadata-send:             187.20 ms (avg 473.75 ns)
 - event-loop-post-incoming-chunk-recv:  374.14 ms (avg 946.85 ns)
 - event-loop-total:                     8.86 s (avg 12.61 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.23 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 338.95 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            10.95 ms (avg 67.56 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.35 GiB   11.35 GiB  101.72 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     162   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:7:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:1:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:6:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:4:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:0:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:5:0] Concatenate
[PRINT:1:0] Concatenate
[PRINT:1:2] Shuffle 110
[PRINT:7:0] Concatenate
[PRINT:7:3] Shuffle 110
[PRINT:4:0] Concatenate
[PRINT:4:3] Shuffle 110
[PRINT:6:0] Concatenate
[PRINT:3:0] Concatenate
[PRINT:2:0] Concatenate
[PRINT:5:4] Shuffle 110
[PRINT:6:4] Shuffle 110
[PRINT:2:4] Shuffle 110
[PRINT:3:4] Shuffle 110
[PRINT:0:0] Concatenate
[PRINT:0:2] Shuffle 110
[PRINT:1:4] chunkwise_groupby: rank processed 8 chunks, 756050362 -> 189000000 rows
[PRINT:6:3] chunkwise_groupby: rank processed 8 chunks, 743997701 -> 186000000 rows
[PRINT:4:4] chunkwise_groupby: rank processed 8 chunks, 743979026 -> 186000000 rows
[PRINT:3:1] chunkwise_groupby: rank processed 8 chunks, 756024820 -> 189000000 rows
[PRINT:5:2] chunkwise_groupby: rank processed 8 chunks, 743997041 -> 186000000 rows
[PRINT:2:1] chunkwise_groupby: rank processed 8 chunks, 755998206 -> 189000000 rows
[PRINT:7:4] chunkwise_groupby: rank processed 8 chunks, 743997625 -> 186000000 rows
[PRINT:0:3] chunkwise_groupby: rank processed 8 chunks, 755944928 -> 189000000 rows
[PRINT:3:4] final_groupby_filter: rank processing 187488260 partial aggregates
[PRINT:2:4] final_groupby_filter: rank processing 187489000 partial aggregates
[PRINT:1:4] final_groupby_filter: rank processing 187495546 partial aggregates
[PRINT:1:4] final_groupby_filter: rank merged to 187495546 unique orderkeys
[PRINT:2:4] final_groupby_filter: rank merged to 187489000 unique orderkeys
[PRINT:1:4] final_groupby_filter: 7857 qualifying orderkeys (sum > 300)
[PRINT:2:4] final_groupby_filter: 7939 qualifying orderkeys (sum > 300)
[PRINT:7:4] final_groupby_filter: rank processing 187488916 partial aggregates
[PRINT:6:3] final_groupby_filter: rank processing 187529132 partial aggregates
[PRINT:4:4] final_groupby_filter: rank processing 187492846 partial aggregates
[PRINT:0:1] final_groupby_filter: rank processing 187513727 partial aggregates
[PRINT:7:4] final_groupby_filter: rank merged to 187488916 unique orderkeys
[PRINT:4:4] final_groupby_filter: rank merged to 187492846 unique orderkeys
[PRINT:6:3] final_groupby_filter: rank merged to 187529132 unique orderkeys
[PRINT:7:4] final_groupby_filter: 7874 qualifying orderkeys (sum > 300)
[PRINT:4:4] final_groupby_filter: 7878 qualifying orderkeys (sum > 300)
[PRINT:6:3] final_groupby_filter: 8028 qualifying orderkeys (sum > 300)
[PRINT:3:4] final_groupby_filter: rank merged to 187488260 unique orderkeys
[PRINT:3:4] final_groupby_filter: 7769 qualifying orderkeys (sum > 300)
[PRINT:0:1] final_groupby_filter: rank merged to 187513727 unique orderkeys
[PRINT:0:1] final_groupby_filter: 8109 qualifying orderkeys (sum > 300)
[PRINT:5:4] final_groupby_filter: rank processing 187502573 partial aggregates
[PRINT:5:4] final_groupby_filter: rank merged to 187502573 unique orderkeys
[PRINT:5:4] final_groupby_filter: 7976 qualifying orderkeys (sum > 300)
[PRINT:5:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:7:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Concatenate
[PRINT:6:3] Shuffle 210
[PRINT:6:3] Shuffle 211
[PRINT:0:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:3] Shuffle 212
[PRINT:6:3] Shuffle 213
[PRINT:1:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:5:0] Inner shuffle join
[PRINT:5:3] Shuffle 210
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Concatenate
[PRINT:5:3] Shuffle 211
[PRINT:5:3] Shuffle 212
[PRINT:5:3] Shuffle 213
[PRINT:2:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Concatenate
[PRINT:1:3] Shuffle 210
[PRINT:4:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:3] Shuffle 211
[PRINT:3:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:3] Shuffle 212
[PRINT:1:3] Shuffle 213
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Inner shuffle join
[PRINT:7:4] Shuffle 210
[PRINT:7:0] Concatenate
[PRINT:7:4] Shuffle 211
[PRINT:7:4] Shuffle 212
[PRINT:7:4] Shuffle 213
[PRINT:0:0] Inner shuffle join
[PRINT:0:0] Inner shuffle join
[PRINT:0:1] Shuffle 210
[PRINT:0:0] Concatenate
[PRINT:0:1] Shuffle 211
[PRINT:0:1] Shuffle 212
[PRINT:0:1] Shuffle 213
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Inner shuffle join
[PRINT:2:4] Shuffle 210
[PRINT:2:0] Concatenate
[PRINT:2:4] Shuffle 211
[PRINT:2:4] Shuffle 212
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Concatenate
[PRINT:2:4] Shuffle 213
[PRINT:3:0] Inner shuffle join
[PRINT:3:0] Inner shuffle join
[PRINT:4:4] Shuffle 210
[PRINT:3:4] Shuffle 210
[PRINT:3:0] Concatenate
[PRINT:4:4] Shuffle 211
[PRINT:3:4] Shuffle 211
[PRINT:4:4] Shuffle 212
[PRINT:3:4] Shuffle 212
[PRINT:3:4] Shuffle 213
[PRINT:4:4] Shuffle 213
[PRINT:6:2] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:3:1] prefilter: rank processed 192000000 -> 8101 rows (0.00421927%)
[PRINT:2:2] prefilter: rank processed 192000000 -> 8050 rows (0.00419271%)
[PRINT:1:1] prefilter: rank processed 192000000 -> 8151 rows (0.00424531%)
[PRINT:0:2] prefilter: rank processed 192000000 -> 8202 rows (0.00427187%)
[PRINT:4:4] prefilter: rank processed 192000000 -> 8058 rows (0.00419688%)
[PRINT:7:1] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:2] prefilter: rank processed 743997701 -> 55594 rows (0.00747233%)
[PRINT:3:4] prefilter: rank processed 756024820 -> 55881 rows (0.00739142%)
[PRINT:1:1] prefilter: rank processed 756050362 -> 55545 rows (0.00734673%)
[PRINT:2:2] prefilter: rank processed 755998206 -> 55685 rows (0.00736576%)
[PRINT:0:4] prefilter: rank processed 755944928 -> 55951 rows (0.00740147%)
[PRINT:4:3] prefilter: rank processed 743979026 -> 55433 rows (0.00745088%)
[PRINT:7:3] prefilter: rank processed 743997625 -> 54663 rows (0.0073472%)
[PRINT:5:1] prefilter: rank processed 180000000 -> 7700 rows (0.00427778%)
[PRINT:5:4] prefilter: rank processed 743997041 -> 55258 rows (0.00742718%)
[PRINT:1:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11504
[PRINT:1:0] Iteration 1 Phase 2 build [s]: 0.000948077
[PRINT:1:0] Iteration 1 Phase 2 run [s]: 3.68258
[PRINT:1:0] Iteration 1 TOTAL [s]: 5.79856
[PRINT:1:0] Statistics:
 - event-loop-check-future-finish:       286.63 ms (avg 537.21 ns)
 - event-loop-init-gpu-data-send:        2.76 s (avg 5.16 us)
 - event-loop-metadata-recv:             884.24 ms (avg 1.66 us)
 - event-loop-metadata-send:             883.72 ms (avg 1.66 us)
 - event-loop-post-incoming-chunk-recv:  3.74 s (avg 7.01 us)
 - event-loop-total:                     17.76 s (avg 20.56 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.08 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 340.39 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.63 ms (avg 39.21 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.83 GiB   12.83 GiB  203.43 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.45 MiB   89.45 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:2:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11591
[PRINT:2:0] Iteration 1 Phase 2 build [s]: 0.00233919
[PRINT:2:0] Iteration 1 Phase 2 run [s]: 3.68048
[PRINT:2:0] Iteration 1 TOTAL [s]: 5.79873
[PRINT:2:0] Statistics:
 - event-loop-check-future-finish:       358.01 ms (avg 808.08 ns)
 - event-loop-init-gpu-data-send:        1.43 s (avg 3.23 us)
 - event-loop-metadata-recv:             399.03 ms (avg 900.67 ns)
 - event-loop-metadata-send:             2.05 s (avg 4.63 us)
 - event-loop-post-incoming-chunk-recv:  3.83 s (avg 8.64 us)
 - event-loop-total:                     16.76 s (avg 21.64 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 339.34 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 339.86 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            15.20 ms (avg 47.22 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   15.12 GiB   15.12 GiB  203.42 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11639
[PRINT:3:0] Iteration 1 Phase 2 build [s]: 0.00233966
[PRINT:3:0] Iteration 1 Phase 2 run [s]: 3.68043
[PRINT:3:0] Iteration 1 TOTAL [s]: 5.79917
[PRINT:3:0] Statistics:
 - event-loop-check-future-finish:       274.42 ms (avg 503.04 ns)
 - event-loop-init-gpu-data-send:        2.20 s (avg 4.03 us)
 - event-loop-metadata-recv:             623.27 ms (avg 1.14 us)
 - event-loop-metadata-send:             2.93 s (avg 5.38 us)
 - event-loop-post-incoming-chunk-recv:  1.36 s (avg 2.49 us)
 - event-loop-total:                     15.50 s (avg 17.59 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.49 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 342.25 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            15.23 ms (avg 47.29 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.06 GiB   11.06 GiB  203.43 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:4:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11623
[PRINT:4:0] Iteration 1 Phase 2 build [s]: 0.00236146
[PRINT:4:0] Iteration 1 Phase 2 run [s]: 3.681
[PRINT:4:0] Iteration 1 TOTAL [s]: 5.7996
[PRINT:4:0] Statistics:
 - event-loop-check-future-finish:       315.24 ms (avg 661.16 ns)
 - event-loop-init-gpu-data-send:        2.08 s (avg 4.36 us)
 - event-loop-metadata-recv:             402.10 ms (avg 843.33 ns)
 - event-loop-metadata-send:             2.05 s (avg 4.30 us)
 - event-loop-post-incoming-chunk-recv:  3.22 s (avg 6.76 us)
 - event-loop-total:                     16.89 s (avg 20.72 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.67 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 337.34 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.84 ms (avg 39.88 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   10.94 GiB   10.94 GiB  201.30 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.47 MiB   89.47 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11273
[PRINT:5:0] Iteration 1 Phase 2 build [s]: 0.00240283
[PRINT:5:0] Iteration 1 Phase 2 run [s]: 3.6847
[PRINT:7:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11394
[PRINT:5:0] Iteration 1 TOTAL [s]: 5.79983
[PRINT:7:0] Iteration 1 Phase 2 build [s]: 0.00240039
[PRINT:7:0] Iteration 1 Phase 2 run [s]: 3.68353
[PRINT:7:0] Iteration 1 TOTAL [s]: 5.79987
[PRINT:5:0] Statistics:
 - event-loop-check-future-finish:       275.69 ms (avg 760.63 ns)
 - event-loop-init-gpu-data-send:        2.97 s (avg 8.19 us)
 - event-loop-metadata-recv:             1.04 s (avg 2.88 us)
 - event-loop-metadata-send:             2.12 s (avg 5.86 us)
 - event-loop-post-incoming-chunk-recv:  3.48 s (avg 9.60 us)
 - event-loop-total:                     20.28 s (avg 30.78 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 339.03 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 335.48 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.69 ms (avg 39.42 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.31 GiB   12.31 GiB  199.46 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.47 MiB   89.47 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11326
[PRINT:6:0] Iteration 1 Phase 2 build [s]: 0.00113433
[PRINT:7:0] Statistics:
 - event-loop-check-future-finish:       292.63 ms (avg 668.17 ns)
 - event-loop-init-gpu-data-send:        2.45 s (avg 5.59 us)
 - event-loop-metadata-recv:             985.09 ms (avg 2.25 us)
 - event-loop-metadata-send:             2.02 s (avg 4.61 us)
 - event-loop-post-incoming-chunk-recv:  3.31 s (avg 7.56 us)
 - event-loop-total:                     18.67 s (avg 24.95 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 337.05 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 338.33 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.48 ms (avg 38.75 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.36 GiB   12.36 GiB  199.46 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 1 Phase 2 run [s]: 3.68562
[PRINT:6:0] Iteration 1 TOTAL [s]: 5.80001
[PRINT:6:0] Statistics:
 - event-loop-check-future-finish:       257.66 ms (avg 579.47 ns)
 - event-loop-init-gpu-data-send:        3.30 s (avg 7.43 us)
 - event-loop-metadata-recv:             965.16 ms (avg 2.17 us)
 - event-loop-metadata-send:             1.93 s (avg 4.34 us)
 - event-loop-post-incoming-chunk-recv:  2.06 s (avg 4.62 us)
 - event-loop-total:                     17.89 s (avg 23.20 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.22 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 334.54 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            14.10 ms (avg 43.79 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   13.33 GiB   13.33 GiB  199.47 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:0:4] Wrote 100 rows to q18-debug-sf1000.pq
[PRINT:0:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11477
[PRINT:0:0] Iteration 1 Phase 2 build [s]: 0.00239231
[PRINT:0:0] Iteration 1 Phase 2 run [s]: 3.68461
[PRINT:0:0] Iteration 1 TOTAL [s]: 5.80178
[PRINT:0:0] Statistics:
 - event-loop-check-future-finish:       611.96 ms (avg 1.06 us)
 - event-loop-init-gpu-data-send:        3.58 s (avg 6.21 us)
 - event-loop-metadata-recv:             371.44 ms (avg 643.43 ns)
 - event-loop-metadata-send:             392.29 ms (avg 679.55 ns)
 - event-loop-post-incoming-chunk-recv:  1.05 s (avg 1.82 us)
 - event-loop-total:                     12.81 s (avg 13.54 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.23 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 338.95 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            17.01 ms (avg 52.50 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.35 GiB   11.35 GiB  203.45 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     324   89.47 MiB   89.47 MiB    6.66 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

nirandaperera · 2025-12-05T00:48:12Z

@rjzamora It's best rebasing this with main ASAP. Otherwise it could be a merge hell later on.

cpp/benchmarks/streaming/ndsh/CMakeLists.txt

cpp/benchmarks/streaming/ndsh/q18.cpp

TomAugspurger · 2025-12-09T18:27:45Z

cpp/benchmarks/streaming/ndsh/q18.cpp

+        ctx, right_msg.release<rapidsmpf::streaming::TableChunk>()
+    );
+
+    auto stream = left_chunk.stream();


My guess is that we need to do something to ensure that right_table is valid on stream before performing the cudf::hash_join on stream. We do have rapidsmpf::cuda_stream_join, which might work.

(as an aside, I think that this is one of the issues compute-sanitizer is supposed to help with).

Okay yeah, I added cuda_stream_join, but I'll probably need to use compute-sanitizer or something to check if the sequence of operations is "correct"

TomAugspurger · 2025-12-09T18:32:34Z

cpp/benchmarks/streaming/ndsh/q18.cpp

+        auto match = joiner.semi_join(
+            table.select({key_column_idx}), chunk_stream, ctx->br()->device_mr()
+        );


Just an observation: this differs from Q4. In Q4 have a relatively left / probe table, and a relatively large right / build table.

…-manual

cpp/benchmarks/streaming/ndsh/CMakeLists.txt

TomAugspurger · 2025-12-10T16:56:51Z

We'll need to sync this (and the other by-hand implementations) with #731, which changes how BufferResources are created.

cpp/benchmarks/streaming/ndsh/q18.cpp

nirandaperera · 2025-12-10T19:19:20Z

cpp/benchmarks/streaming/ndsh/q18.cpp

+        for (auto&& r : results) {
+            std::ranges::move(r.results, std::back_inserter(cols));
+        }
+        local_result = std::make_unique<cudf::table>(std::move(cols));


should we release table at this point?

I tried to do this by moving the chunk in a local context after this. Let me know if there is a better way.

nirandaperera · 2025-12-10T19:26:13Z

cpp/benchmarks/streaming/ndsh/q18.cpp

+        auto sorted = cudf::sort_by_key(
+            global_result->view(),
+            global_result->view().select({4, 3}),  // o_totalprice DESC, o_orderdate ASC
+            {cudf::order::DESCENDING, cudf::order::ASCENDING},
+            {cudf::null_order::AFTER, cudf::null_order::AFTER},
+            stream,
+            ctx->br()->device_mr()
+        );
+        cudf::size_type limit = std::min(100, sorted->num_rows());
+        auto limited = cudf::slice(sorted->view(), {0, limit})[0];


Shouldnt we use cudf::top_k here? It would be more efficient I think.
Do we need the top 100 to be in order?

I couldn't figure out a clean way to use cudf::top_k_order when we are sorting on multiple columns. We may may be able to tackle it in two phases, but I'm pretty sure we are dealing with a relatively-small number of columns at this point anyway (even at sf3000). Perhaps if further profiling suggests I'm wrong about this, we can revisit?

TomAugspurger · 2026-01-04T13:57:44Z

@rjzamora rjzamora#1 has the changes to use some of the streaming groupby / join utilities that we have on main now. Mind if I push that here?

rjzamora self-assigned this Dec 3, 2025

rjzamora added feature request New feature or request non-breaking Introduces a non-breaking change labels Dec 3, 2025

rjzamora mentioned this pull request Dec 3, 2025

Manually Construct Multi-GPU C++ TPC-H Queries #693

Open

8 tasks

rjzamora added improvement Improves an existing functionality and removed feature request New feature or request labels Dec 4, 2025

Add TPC-H Q18 streaming benchmark with two-phase prefilter optimization

4d03761

rjzamora force-pushed the q-18-manual branch from 1d23268 to 4d03761 Compare December 8, 2025 14:45

rjzamora added 2 commits December 8, 2025 06:52

update usage docs

4dea5e8

cleanup

0a129a9

rjzamora force-pushed the q-18-manual branch from 57c50ba to 0a129a9 Compare December 9, 2025 17:09

rjzamora marked this pull request as ready for review December 9, 2025 17:49

rjzamora requested review from a team as code owners December 9, 2025 17:49

Merge branch 'main' into q-18-manual

eb8a8f0

rjzamora changed the title ~~[WIP] TPCH-derived Q18~~ TPCH-derived Q18 Dec 9, 2025

KyleFromNVIDIA requested changes Dec 9, 2025

View reviewed changes

cpp/benchmarks/streaming/ndsh/CMakeLists.txt Outdated Show resolved Hide resolved

TomAugspurger reviewed Dec 9, 2025

View reviewed changes

rjzamora added 3 commits December 9, 2025 12:19

update cmake and avoid default stream

7a14f28

Merge branch 'q-18-manual' of github.com:rjzamora/rapidsmpf into q-18…

c4830de

…-manual

Merge remote-tracking branch 'upstream/main' into q-18-manual

9d5dbe0

KyleFromNVIDIA approved these changes Dec 9, 2025

View reviewed changes

Merge branch 'main' into q-18-manual

afbb398

TomAugspurger reviewed Dec 10, 2025

View reviewed changes

cpp/benchmarks/streaming/ndsh/CMakeLists.txt Outdated Show resolved Hide resolved

cmake update

ca1171c

nirandaperera requested changes Dec 10, 2025

View reviewed changes

rjzamora and others added 14 commits December 11, 2025 06:06

Merge remote-tracking branch 'upstream/main' into q-18-manual

3b02df6

add PinnedMemoryResourceDisabled arg

afbc824

check for unexpected second messages

b74ebfa

remove unnecessary copy

3b7cbfd

simplify op-id tracking

e08f128

simplifications based on code review

379e1b7

fuse reorder_columns into chunkwise_groupby_agg

f54c223

add more empty-msg checks

a8e0848

release original chunk in final_groupby_and_sort

2e9fd36

Merge branch 'main' into q-18-manual

657a99d

Merge remote-tracking branch 'upstream/main' into q-18-manual

4e8ed3f

Argument handling

d911352

refactor

9dd047a

Options parsing

cad5d1c

rjzamora added 3 commits January 8, 2026 14:44

Merge remote-tracking branch 'upstream/main' into q-18-manual

3a906a1

Merge remote-tracking branch 'tom/q-18-manual' into q-18-manual

9f6d10f

Merge branch 'main' into q-18-manual

6c1f87b

Conversation

rjzamora commented Dec 3, 2025

Uh oh!

copy-pr-bot bot commented Dec 3, 2025

Uh oh!

rjzamora commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wence- commented Dec 4, 2025

Uh oh!

rjzamora commented Dec 4, 2025

Uh oh!

rjzamora commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nirandaperera commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirandaperera Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

nirandaperera Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

rjzamora Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented Dec 3, 2025 •

edited

Loading

rjzamora commented Dec 4, 2025 •

edited

Loading

TomAugspurger commented Jan 4, 2026 •

edited

Loading