Skip to content

TPCH-derived Q18#705

Open
rjzamora wants to merge 26 commits intorapidsai:mainfrom
rjzamora:q-18-manual
Open

TPCH-derived Q18#705
rjzamora wants to merge 26 commits intorapidsai:mainfrom
rjzamora:q-18-manual

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 3, 2025

@rjzamora rjzamora self-assigned this Dec 3, 2025
@rjzamora rjzamora added feature request New feature or request non-breaking Introduces a non-breaking change labels Dec 3, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora
Copy link
Member Author

rjzamora commented Dec 3, 2025

[EDIT: This is no longer the network design]
Current network design (Only tested on 1xV100 at scale-factor 10 through 300):

┌──────────────────────────────────────────────────────────────────────────────┐
│                              DATA SOURCES                                    │
├────────────────────┬────────────────────────┬────────────────────────────────┤
│   Read Lineitem    │     Read Orders        │      Read Customer             │
│  (l_orderkey,      │    (o_orderkey,        │     (c_custkey,                │
│   l_quantity)      │     o_custkey, ...)    │      c_name)                   │
└─────────┬──────────┴───────────┬────────────┴──────────────┬─────────────────┘
          │                      │                           │
          ▼                      │                           │
┌─────────────────────┐          │                           │
│    Shuffle (1)      │          │                           │
│   by l_orderkey     │          │                           │
│   16 partitions     │          │                           │
└─────────┬───────────┘          │                           │
          │                      │                           │
          ▼                      │                           │
┌─────────────────────┐          │                           │
│   Fanout (UNBND)    │          │                           │
└────┬───────────┬────┘          │                           │
     │           │               │                           │
     │           ▼               │                           │
     │    ┌─────────────────┐    │                           │
     │    │spillable_buffer │    │                           │
     │    │(can be spilled) │    │                           │
     │    └────────┬────────┘    │                           │
     │             │             │                           │
     ▼             │             ▼                           │
┌──────────┐       │      ┌─────────────────────┐            │
│ Groupby  │       │      │    Shuffle (2)      │            │
│ +Filter  │       │      │   by o_orderkey     │            │
│ sum>300  │       │      │   16 partitions     │            │
└────┬─────┘       │      └──────────┬──────────┘            │
     │             │                 │                       │
     │  (A)        │  (B)            │  (C)                  │
     └─────────────┼─────────────────┘                       │
                   │                                         │
                   ▼                                         │
          ┌────────────────────┐                             │
          │ Shuffle Semi-Join  │                             │
          │   (A) ⋈ (C)        │                             │
          │ partition-by-part  │                             │
          └─────────┬──────────┘                             │
                    │                                        │
                    │  (D) orders_filtered                   │
                    │                                        │
                    └────────────────┐                       │
                                     │                       │
                                     ▼                       │
                          ┌──────────────────────┐           │
                          │ Shuffle Inner Join   │           │
                          │   (D) ⋈ (B)          │           │
                          │ partition-by-part    │           │
                          └──────────┬───────────┘           │
                                     │                       │
                                     │  (E) orders_x_lineitem│
                                     │                       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │ Broadcast Inner Join  │
                                     │   Customer ⋈ (E)      │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Reorder Columns     │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Chunkwise Groupby   │
                                     │   (partial agg)       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │     Concatenate       │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │    Final Groupby      │
                                     │   (merge partials)    │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │   Sort + Limit 100    │
                                     └───────────┬───────────┘
                                                 │
                                                 ▼
                                     ┌───────────────────────┐
                                     │    Write Parquet      │
                                     └───────────────────────┘

@wence-
Copy link
Contributor

wence- commented Dec 4, 2025

Note that because you're shuffling the lineitem table, I think the fanout can be bounded rather than unbounded.

@rjzamora
Copy link
Member Author

rjzamora commented Dec 4, 2025

Note that because you're shuffling the lineitem table, I think the fanout can be bounded rather than unbounded.

I used cursor to generate the diagram, and so it's a bit hard to tell: I think we do need an unbounded fanout after the lineitem shuffle, because the first consumer immediately pulls all of the output into a groupy, while the other consumer isn't used at all until after the output of that groupby operation is joined with the orders table. The problem is that the other consumer is joined to the output of that groupby-orders join. This effectively requires us to cache the entirety of the lineitem shuffle somwhere until after the groupby-oders join has started.

@rjzamora rjzamora added improvement Improves an existing functionality and removed feature request New feature or request labels Dec 4, 2025
@rjzamora
Copy link
Member Author

rjzamora commented Dec 4, 2025

Added a q18_prefilter variation that executes Query 18 in two sequential pipelines. The first performs a groupby aggregation on the lineitems table to find the set of allowed order keys. Then the second phase applies this orderkey filter to every partition before shuffling or joining anything (dramatically reducing the amount of data we need to move around). This physical plan is probably difficult to generate automatically, but it probably gives us a good SOL reference.

For sf3k on Viking (8xH100), the first two iterations take about 21s and 6s, respectively:

FORCE_UCX_NET_DEVICES=unset LIBCUDF_NUM_HOST_WORKERS=4 KVIKIO_NTHREADS=4 UCX_TLS=^sm UCX_MAX_RNDV_RAILS=1 mpiexec -n 8 ./binder.sh ${BENCH_DIR}/q18_prefilter --input-directory $DATA_DIR --output-file q18-debug-sf3000.pq --spill-device-limit 0.9 --use-shuffle --num-partitions 256
Output
MPI Rank 5 host viking-prod-206 GPU=5 cores=70,71,72,73,74,75,76,77,78,79,80,81,82,83 membind=1 UCX_NET_DEVICES=
MPI Rank 4 host viking-prod-206 GPU=4 cores=56,57,58,59,60,61,62,63,64,65,66,67,68,69 membind=1 UCX_NET_DEVICES=
MPI Rank 0 host viking-prod-206 GPU=0 cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13 membind=0 UCX_NET_DEVICES=
MPI Rank 6 host viking-prod-206 GPU=6 cores=84,85,86,87,88,89,90,91,92,93,94,95,96,97 membind=1 UCX_NET_DEVICES=
MPI Rank 7 host viking-prod-206 GPU=7 cores=98,99,100,101,102,103,104,105,106,107,108,109,110,111 membind=1 UCX_NET_DEVICES=
MPI Rank 3 host viking-prod-206 GPU=3 cores=42,43,44,45,46,47,48,49,50,51,52,53,54,55 membind=0 UCX_NET_DEVICES=
MPI Rank 2 host viking-prod-206 GPU=2 cores=28,29,30,31,32,33,34,35,36,37,38,39,40,41 membind=0 UCX_NET_DEVICES=
MPI Rank 1 host viking-prod-206 GPU=1 cores=14,15,16,17,18,19,20,21,22,23,24,25,26,27 membind=0 UCX_NET_DEVICES=
[PRINT:2:0] Q18 Pre-filter Benchmark
[PRINT:2:0] Executor has 4 threads
[PRINT:2:0] Executor has 8 ranks
[PRINT:2:0] Mode: shuffle, partitions: 256
[PRINT:2:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Q18 Pre-filter Benchmark
[PRINT:3:0] Executor has 4 threads
[PRINT:3:0] Executor has 8 ranks
[PRINT:3:0] Mode: shuffle, partitions: 256
[PRINT:3:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:5:0] Q18 Pre-filter Benchmark
[PRINT:5:0] Executor has 4 threads
[PRINT:5:0] Executor has 8 ranks
[PRINT:5:0] Mode: shuffle, partitions: 256
[PRINT:5:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:7:0] Q18 Pre-filter Benchmark
[PRINT:7:0] Executor has 4 threads
[PRINT:7:0] Executor has 8 ranks
[PRINT:7:0] Mode: shuffle, partitions: 256
[PRINT:7:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:0] Concatenate
[PRINT:4:0] Q18 Pre-filter Benchmark
[PRINT:4:0] Executor has 4 threads
[PRINT:4:0] Executor has 8 ranks
[PRINT:4:0] Mode: shuffle, partitions: 256
[PRINT:4:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:1] Shuffle 100
[PRINT:1:0] Q18 Pre-filter Benchmark
[PRINT:1:0] Executor has 4 threads
[PRINT:1:0] Executor has 8 ranks
[PRINT:1:0] Mode: shuffle, partitions: 256
[PRINT:1:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Concatenate
[PRINT:3:1] Shuffle 100
[PRINT:5:0] Concatenate
[PRINT:5:1] Shuffle 100
[PRINT:7:0] Concatenate
[PRINT:4:0] Concatenate
[PRINT:7:1] Shuffle 100
[PRINT:4:1] Shuffle 100
[PRINT:1:0] Concatenate
[PRINT:1:1] Shuffle 100
[PRINT:6:0] Q18 Pre-filter Benchmark
[PRINT:6:0] Executor has 4 threads
[PRINT:6:0] Executor has 8 ranks
[PRINT:6:0] Mode: shuffle, partitions: 256
[PRINT:6:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:0:0] Q18 Pre-filter Benchmark
[PRINT:0:0] Executor has 4 threads
[PRINT:0:0] Executor has 8 ranks
[PRINT:0:0] Mode: shuffle, partitions: 256
[PRINT:0:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:6:0] Concatenate
[PRINT:0:0] Concatenate
[PRINT:0:1] Shuffle 100
[PRINT:6:1] Shuffle 100
[PRINT:5:2] chunkwise_groupby: rank processed 8 chunks, 743997041 -> 186000000 rows
[PRINT:0:1] chunkwise_groupby: rank processed 8 chunks, 755944928 -> 189000000 rows
[PRINT:2:1] chunkwise_groupby: rank processed 8 chunks, 755998206 -> 189000000 rows
[PRINT:7:2] chunkwise_groupby: rank processed 8 chunks, 743997625 -> 186000000 rows
[PRINT:6:2] chunkwise_groupby: rank processed 8 chunks, 743997701 -> 186000000 rows
[PRINT:3:2] chunkwise_groupby: rank processed 8 chunks, 756024820 -> 189000000 rows
[PRINT:4:2] chunkwise_groupby: rank processed 8 chunks, 743979026 -> 186000000 rows
[PRINT:1:2] chunkwise_groupby: rank processed 8 chunks, 756050362 -> 189000000 rows
[PRINT:2:2] final_groupby_filter: rank processing 187489000 partial aggregates
[PRINT:2:2] final_groupby_filter: rank merged to 187489000 unique orderkeys
[PRINT:2:2] final_groupby_filter: 7939 qualifying orderkeys (sum > 300)
[PRINT:3:3] final_groupby_filter: rank processing 187488260 partial aggregates
[PRINT:0:2] final_groupby_filter: rank processing 187513727 partial aggregates
[PRINT:0:2] final_groupby_filter: rank merged to 187513727 unique orderkeys
[PRINT:3:3] final_groupby_filter: rank merged to 187488260 unique orderkeys
[PRINT:6:2] final_groupby_filter: rank processing 187529132 partial aggregates
[PRINT:0:2] final_groupby_filter: 8109 qualifying orderkeys (sum > 300)
[PRINT:3:3] final_groupby_filter: 7769 qualifying orderkeys (sum > 300)
[PRINT:5:3] final_groupby_filter: rank processing 187502573 partial aggregates
[PRINT:4:3] final_groupby_filter: rank processing 187492846 partial aggregates
[PRINT:1:3] final_groupby_filter: rank processing 187495546 partial aggregates
[PRINT:4:3] final_groupby_filter: rank merged to 187492846 unique orderkeys
[PRINT:6:2] final_groupby_filter: rank merged to 187529132 unique orderkeys
[PRINT:4:3] final_groupby_filter: 7878 qualifying orderkeys (sum > 300)
[PRINT:6:2] final_groupby_filter: 8028 qualifying orderkeys (sum > 300)
[PRINT:5:3] final_groupby_filter: rank merged to 187502573 unique orderkeys
[PRINT:7:1] final_groupby_filter: rank processing 187488916 partial aggregates
[PRINT:1:3] final_groupby_filter: rank merged to 187495546 unique orderkeys
[PRINT:5:3] final_groupby_filter: 7976 qualifying orderkeys (sum > 300)
[PRINT:1:3] final_groupby_filter: 7857 qualifying orderkeys (sum > 300)
[PRINT:7:1] final_groupby_filter: rank merged to 187488916 unique orderkeys
[PRINT:7:1] final_groupby_filter: 7874 qualifying orderkeys (sum > 300)
[PRINT:5:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:7:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:0:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:2:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:3:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:4:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Concatenate
[PRINT:5:4] Shuffle 200
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Concatenate
[PRINT:6:1] Shuffle 200
[PRINT:5:4] Shuffle 201
[PRINT:6:1] Shuffle 201
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Inner shuffle join
[PRINT:1:3] Shuffle 200
[PRINT:5:4] Shuffle 202
[PRINT:1:0] Concatenate
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Concatenate
[PRINT:6:1] Shuffle 202
[PRINT:1:3] Shuffle 201
[PRINT:7:1] Shuffle 200
[PRINT:5:4] Shuffle 203
[PRINT:1:3] Shuffle 202
[PRINT:6:1] Shuffle 203
[PRINT:1:3] Shuffle 203
[PRINT:0:0] Inner shuffle join
[PRINT:0:0] Inner shuffle join
[PRINT:0:2] Shuffle 200
[PRINT:0:0] Concatenate
[PRINT:7:1] Shuffle 201
[PRINT:0:2] Shuffle 201
[PRINT:7:1] Shuffle 202
[PRINT:0:2] Shuffle 202
[PRINT:7:1] Shuffle 203
[PRINT:0:2] Shuffle 203
[PRINT:4:0] Inner shuffle join
[PRINT:4:1] Shuffle 200
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Concatenate
[PRINT:4:1] Shuffle 201
[PRINT:4:1] Shuffle 202
[PRINT:4:1] Shuffle 203
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Concatenate
[PRINT:2:2] Shuffle 200
[PRINT:2:2] Shuffle 201
[PRINT:2:2] Shuffle 202
[PRINT:2:2] Shuffle 203
[PRINT:3:0] Inner shuffle join
[PRINT:3:0] Inner shuffle join
[PRINT:3:3] Shuffle 200
[PRINT:3:0] Concatenate
[PRINT:3:3] Shuffle 201
[PRINT:3:3] Shuffle 202
[PRINT:3:3] Shuffle 203
[PRINT:3:1] prefilter: rank processed 192000000 -> 8101 rows (0.00421927%)
[PRINT:0:3] prefilter: rank processed 192000000 -> 8202 rows (0.00427187%)
[PRINT:5:2] prefilter: rank processed 180000000 -> 7700 rows (0.00427778%)
[PRINT:3:3] prefilter: rank processed 756024820 -> 55881 rows (0.00739142%)
[PRINT:0:3] prefilter: rank processed 755944928 -> 55951 rows (0.00740147%)
[PRINT:1:4] prefilter: rank processed 192000000 -> 8151 rows (0.00424531%)
[PRINT:1:2] prefilter: rank processed 756050362 -> 55545 rows (0.00734673%)
[PRINT:5:4] prefilter: rank processed 743997041 -> 55258 rows (0.00742718%)
[PRINT:4:4] prefilter: rank processed 192000000 -> 8058 rows (0.00419688%)
[PRINT:4:2] prefilter: rank processed 743979026 -> 55433 rows (0.00745088%)
[PRINT:2:3] prefilter: rank processed 192000000 -> 8050 rows (0.00419271%)
[PRINT:2:2] prefilter: rank processed 755998206 -> 55685 rows (0.00736576%)
[PRINT:7:1] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:3] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:1] prefilter: rank processed 743997701 -> 55594 rows (0.00747233%)
[PRINT:7:1] prefilter: rank processed 743997625 -> 54663 rows (0.0073472%)
[PRINT:2:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8914
[PRINT:2:0] Iteration 0 Phase 2 build [s]: 0.00196167
[PRINT:2:0] Iteration 0 Phase 2 run [s]: 3.81605
[PRINT:2:0] Iteration 0 TOTAL [s]: 21.7094
[PRINT:2:0] Statistics:
 - event-loop-check-future-finish:       249.18 ms (avg 962.32 ns)
 - event-loop-init-gpu-data-send:        874.20 ms (avg 3.38 us)
 - event-loop-metadata-recv:             285.68 ms (avg 1.10 us)
 - event-loop-metadata-send:             1.85 s (avg 7.16 us)
 - event-loop-post-incoming-chunk-recv:  3.28 s (avg 12.65 us)
 - event-loop-total:                     13.43 s (avg 25.95 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 339.34 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 339.86 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            9.34 ms (avg 58.02 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   15.12 GiB   15.12 GiB  101.71 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:7:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8889
[PRINT:7:0] Iteration 0 Phase 2 build [s]: 0.00240772
[PRINT:7:0] Iteration 0 Phase 2 run [s]: 3.81773
[PRINT:7:0] Iteration 0 TOTAL [s]: 21.709
[PRINT:1:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8884
[PRINT:1:0] Iteration 0 Phase 2 build [s]: 0.00239001
[PRINT:1:0] Iteration 0 Phase 2 run [s]: 3.81712
[PRINT:1:0] Iteration 0 TOTAL [s]: 21.7079
[PRINT:7:0] Statistics:
 - event-loop-check-future-finish:       161.76 ms (avg 639.00 ns)
 - event-loop-init-gpu-data-send:        1.94 s (avg 7.67 us)
 - event-loop-metadata-recv:             879.34 ms (avg 3.47 us)
 - event-loop-metadata-send:             1.81 s (avg 7.16 us)
 - event-loop-post-incoming-chunk-recv:  2.64 s (avg 10.45 us)
 - event-loop-total:                     15.20 s (avg 30.89 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 337.05 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 338.33 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.52 ms (avg 40.49 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.36 GiB   12.36 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8916
[PRINT:3:0] Iteration 0 Phase 2 build [s]: 0.00244337
[PRINT:3:0] Iteration 0 Phase 2 run [s]: 3.81564
[PRINT:3:0] Iteration 0 TOTAL [s]: 21.7097
[PRINT:1:0] Statistics:
 - event-loop-check-future-finish:       161.23 ms (avg 473.39 ns)
 - event-loop-init-gpu-data-send:        2.31 s (avg 6.80 us)
 - event-loop-metadata-recv:             766.09 ms (avg 2.25 us)
 - event-loop-metadata-send:             684.58 ms (avg 2.01 us)
 - event-loop-post-incoming-chunk-recv:  3.31 s (avg 9.73 us)
 - event-loop-total:                     14.95 s (avg 25.21 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.08 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 340.39 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.26 ms (avg 38.86 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.83 GiB   12.83 GiB  101.72 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.45 MiB   89.45 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Statistics:
 - event-loop-check-future-finish:       160.08 ms (avg 444.45 ns)
 - event-loop-init-gpu-data-send:        1.75 s (avg 4.85 us)
 - event-loop-metadata-recv:             501.98 ms (avg 1.39 us)
 - event-loop-metadata-send:             2.71 s (avg 7.51 us)
 - event-loop-post-incoming-chunk-recv:  726.99 ms (avg 2.02 us)
 - event-loop-total:                     12.14 s (avg 19.29 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.49 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 342.25 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            9.26 ms (avg 57.52 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.06 GiB   11.06 GiB  101.71 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8773
[PRINT:6:0] Iteration 0 Phase 2 build [s]: 0.00248518
[PRINT:6:0] Iteration 0 Phase 2 run [s]: 3.81825
[PRINT:6:0] Iteration 0 TOTAL [s]: 21.698
[PRINT:4:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8908
[PRINT:4:0] Iteration 0 Phase 2 build [s]: 0.00199225
[PRINT:4:0] Iteration 0 Phase 2 run [s]: 3.81637
[PRINT:4:0] Iteration 0 TOTAL [s]: 21.7092
[PRINT:5:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8887
[PRINT:5:0] Iteration 0 Phase 2 build [s]: 0.00241324
[PRINT:5:0] Iteration 0 Phase 2 run [s]: 3.81878
[PRINT:5:0] Iteration 0 TOTAL [s]: 21.7099
[PRINT:6:0] Statistics:
 - event-loop-check-future-finish:       141.83 ms (avg 552.02 ns)
 - event-loop-init-gpu-data-send:        2.76 s (avg 10.76 us)
 - event-loop-metadata-recv:             866.96 ms (avg 3.37 us)
 - event-loop-metadata-send:             1.70 s (avg 6.60 us)
 - event-loop-post-incoming-chunk-recv:  1.45 s (avg 5.63 us)
 - event-loop-total:                     14.43 s (avg 28.22 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.22 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 334.54 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            7.29 ms (avg 45.25 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   13.33 GiB   13.33 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.48 MiB   89.48 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:4:0] Statistics:
 - event-loop-check-future-finish:       198.81 ms (avg 673.50 ns)
 - event-loop-init-gpu-data-send:        1.51 s (avg 5.10 us)
 - event-loop-metadata-recv:             297.60 ms (avg 1.01 us)
 - event-loop-metadata-send:             1.84 s (avg 6.23 us)
 - event-loop-post-incoming-chunk-recv:  2.66 s (avg 9.01 us)
 - event-loop-total:                     13.46 s (avg 23.98 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.67 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 337.34 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.07 ms (avg 37.69 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   10.89 GiB   10.89 GiB  100.65 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Statistics:
 - event-loop-check-future-finish:       155.79 ms (avg 475.41 ns)
 - event-loop-init-gpu-data-send:        2.32 s (avg 7.07 us)
 - event-loop-metadata-recv:             991.63 ms (avg 3.03 us)
 - event-loop-metadata-send:             1.93 s (avg 5.90 us)
 - event-loop-post-incoming-chunk-recv:  617.86 ms (avg 1.89 us)
 - event-loop-total:                     12.47 s (avg 20.70 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 339.03 KiB)
 - shuffle-payload-send:                 3.30 GiB (avg 335.48 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            6.32 ms (avg 39.25 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1    9.31 GiB    9.31 GiB   99.73 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.30 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.31 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     161   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:0:2] Wrote 100 rows to q18-debug-sf1000.pq
[PRINT:0:0] Iteration 0 Phase 1 (groupby+filter) [s]: 17.8781
[PRINT:0:0] Iteration 0 Phase 2 build [s]: 0.00229348
[PRINT:0:0] Iteration 0 Phase 2 run [s]: 4.09129
[PRINT:0:0] Iteration 0 TOTAL [s]: 21.9717
[PRINT:0:0] Statistics:
 - event-loop-check-future-finish:       496.58 ms (avg 1.26 us)
 - event-loop-init-gpu-data-send:        2.83 s (avg 7.17 us)
 - event-loop-metadata-recv:             265.08 ms (avg 670.85 ns)
 - event-loop-metadata-send:             187.20 ms (avg 473.75 ns)
 - event-loop-post-incoming-chunk-recv:  374.14 ms (avg 946.85 ns)
 - event-loop-total:                     8.86 s (avg 12.61 us)
 - shuffle-payload-recv:                 3.32 GiB (avg 338.23 KiB)
 - shuffle-payload-send:                 3.34 GiB (avg 338.95 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            10.95 ms (avg 67.56 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.35 GiB   11.35 GiB  101.72 GiB  main (all allocations using RmmResourceAdaptor)
      51    1.05 GiB    1.05 GiB    8.41 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
      51  536.61 MiB  536.61 MiB    3.35 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     162   89.47 MiB   89.47 MiB    3.33 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:7:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:1:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:6:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:4:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:2:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:3:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:0:0] Phase 1 (shuffle): Computing qualifying orderkeys with 256 partitions
[PRINT:5:0] Concatenate
[PRINT:1:0] Concatenate
[PRINT:1:2] Shuffle 110
[PRINT:7:0] Concatenate
[PRINT:7:3] Shuffle 110
[PRINT:4:0] Concatenate
[PRINT:4:3] Shuffle 110
[PRINT:6:0] Concatenate
[PRINT:3:0] Concatenate
[PRINT:2:0] Concatenate
[PRINT:5:4] Shuffle 110
[PRINT:6:4] Shuffle 110
[PRINT:2:4] Shuffle 110
[PRINT:3:4] Shuffle 110
[PRINT:0:0] Concatenate
[PRINT:0:2] Shuffle 110
[PRINT:1:4] chunkwise_groupby: rank processed 8 chunks, 756050362 -> 189000000 rows
[PRINT:6:3] chunkwise_groupby: rank processed 8 chunks, 743997701 -> 186000000 rows
[PRINT:4:4] chunkwise_groupby: rank processed 8 chunks, 743979026 -> 186000000 rows
[PRINT:3:1] chunkwise_groupby: rank processed 8 chunks, 756024820 -> 189000000 rows
[PRINT:5:2] chunkwise_groupby: rank processed 8 chunks, 743997041 -> 186000000 rows
[PRINT:2:1] chunkwise_groupby: rank processed 8 chunks, 755998206 -> 189000000 rows
[PRINT:7:4] chunkwise_groupby: rank processed 8 chunks, 743997625 -> 186000000 rows
[PRINT:0:3] chunkwise_groupby: rank processed 8 chunks, 755944928 -> 189000000 rows
[PRINT:3:4] final_groupby_filter: rank processing 187488260 partial aggregates
[PRINT:2:4] final_groupby_filter: rank processing 187489000 partial aggregates
[PRINT:1:4] final_groupby_filter: rank processing 187495546 partial aggregates
[PRINT:1:4] final_groupby_filter: rank merged to 187495546 unique orderkeys
[PRINT:2:4] final_groupby_filter: rank merged to 187489000 unique orderkeys
[PRINT:1:4] final_groupby_filter: 7857 qualifying orderkeys (sum > 300)
[PRINT:2:4] final_groupby_filter: 7939 qualifying orderkeys (sum > 300)
[PRINT:7:4] final_groupby_filter: rank processing 187488916 partial aggregates
[PRINT:6:3] final_groupby_filter: rank processing 187529132 partial aggregates
[PRINT:4:4] final_groupby_filter: rank processing 187492846 partial aggregates
[PRINT:0:1] final_groupby_filter: rank processing 187513727 partial aggregates
[PRINT:7:4] final_groupby_filter: rank merged to 187488916 unique orderkeys
[PRINT:4:4] final_groupby_filter: rank merged to 187492846 unique orderkeys
[PRINT:6:3] final_groupby_filter: rank merged to 187529132 unique orderkeys
[PRINT:7:4] final_groupby_filter: 7874 qualifying orderkeys (sum > 300)
[PRINT:4:4] final_groupby_filter: 7878 qualifying orderkeys (sum > 300)
[PRINT:6:3] final_groupby_filter: 8028 qualifying orderkeys (sum > 300)
[PRINT:3:4] final_groupby_filter: rank merged to 187488260 unique orderkeys
[PRINT:3:4] final_groupby_filter: 7769 qualifying orderkeys (sum > 300)
[PRINT:0:1] final_groupby_filter: rank merged to 187513727 unique orderkeys
[PRINT:0:1] final_groupby_filter: 8109 qualifying orderkeys (sum > 300)
[PRINT:5:4] final_groupby_filter: rank processing 187502573 partial aggregates
[PRINT:5:4] final_groupby_filter: rank merged to 187502573 unique orderkeys
[PRINT:5:4] final_groupby_filter: 7976 qualifying orderkeys (sum > 300)
[PRINT:5:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:7:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Inner shuffle join
[PRINT:6:0] Concatenate
[PRINT:6:3] Shuffle 210
[PRINT:6:3] Shuffle 211
[PRINT:0:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:6:3] Shuffle 212
[PRINT:6:3] Shuffle 213
[PRINT:1:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:5:0] Inner shuffle join
[PRINT:5:3] Shuffle 210
[PRINT:5:0] Inner shuffle join
[PRINT:5:0] Concatenate
[PRINT:5:3] Shuffle 211
[PRINT:5:3] Shuffle 212
[PRINT:5:3] Shuffle 213
[PRINT:2:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Inner shuffle join
[PRINT:1:0] Concatenate
[PRINT:1:3] Shuffle 210
[PRINT:4:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:3] Shuffle 211
[PRINT:3:0] Phase 1 complete: 63430 qualifying orderkeys
[PRINT:1:3] Shuffle 212
[PRINT:1:3] Shuffle 213
[PRINT:7:0] Inner shuffle join
[PRINT:7:0] Inner shuffle join
[PRINT:7:4] Shuffle 210
[PRINT:7:0] Concatenate
[PRINT:7:4] Shuffle 211
[PRINT:7:4] Shuffle 212
[PRINT:7:4] Shuffle 213
[PRINT:0:0] Inner shuffle join
[PRINT:0:0] Inner shuffle join
[PRINT:0:1] Shuffle 210
[PRINT:0:0] Concatenate
[PRINT:0:1] Shuffle 211
[PRINT:0:1] Shuffle 212
[PRINT:0:1] Shuffle 213
[PRINT:2:0] Inner shuffle join
[PRINT:2:0] Inner shuffle join
[PRINT:2:4] Shuffle 210
[PRINT:2:0] Concatenate
[PRINT:2:4] Shuffle 211
[PRINT:2:4] Shuffle 212
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Inner shuffle join
[PRINT:4:0] Concatenate
[PRINT:2:4] Shuffle 213
[PRINT:3:0] Inner shuffle join
[PRINT:3:0] Inner shuffle join
[PRINT:4:4] Shuffle 210
[PRINT:3:4] Shuffle 210
[PRINT:3:0] Concatenate
[PRINT:4:4] Shuffle 211
[PRINT:3:4] Shuffle 211
[PRINT:4:4] Shuffle 212
[PRINT:3:4] Shuffle 212
[PRINT:3:4] Shuffle 213
[PRINT:4:4] Shuffle 213
[PRINT:6:2] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:3:1] prefilter: rank processed 192000000 -> 8101 rows (0.00421927%)
[PRINT:2:2] prefilter: rank processed 192000000 -> 8050 rows (0.00419271%)
[PRINT:1:1] prefilter: rank processed 192000000 -> 8151 rows (0.00424531%)
[PRINT:0:2] prefilter: rank processed 192000000 -> 8202 rows (0.00427187%)
[PRINT:4:4] prefilter: rank processed 192000000 -> 8058 rows (0.00419688%)
[PRINT:7:1] prefilter: rank processed 180000000 -> 7584 rows (0.00421333%)
[PRINT:6:2] prefilter: rank processed 743997701 -> 55594 rows (0.00747233%)
[PRINT:3:4] prefilter: rank processed 756024820 -> 55881 rows (0.00739142%)
[PRINT:1:1] prefilter: rank processed 756050362 -> 55545 rows (0.00734673%)
[PRINT:2:2] prefilter: rank processed 755998206 -> 55685 rows (0.00736576%)
[PRINT:0:4] prefilter: rank processed 755944928 -> 55951 rows (0.00740147%)
[PRINT:4:3] prefilter: rank processed 743979026 -> 55433 rows (0.00745088%)
[PRINT:7:3] prefilter: rank processed 743997625 -> 54663 rows (0.0073472%)
[PRINT:5:1] prefilter: rank processed 180000000 -> 7700 rows (0.00427778%)
[PRINT:5:4] prefilter: rank processed 743997041 -> 55258 rows (0.00742718%)
[PRINT:1:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11504
[PRINT:1:0] Iteration 1 Phase 2 build [s]: 0.000948077
[PRINT:1:0] Iteration 1 Phase 2 run [s]: 3.68258
[PRINT:1:0] Iteration 1 TOTAL [s]: 5.79856
[PRINT:1:0] Statistics:
 - event-loop-check-future-finish:       286.63 ms (avg 537.21 ns)
 - event-loop-init-gpu-data-send:        2.76 s (avg 5.16 us)
 - event-loop-metadata-recv:             884.24 ms (avg 1.66 us)
 - event-loop-metadata-send:             883.72 ms (avg 1.66 us)
 - event-loop-post-incoming-chunk-recv:  3.74 s (avg 7.01 us)
 - event-loop-total:                     17.76 s (avg 20.56 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.08 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 340.39 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.63 ms (avg 39.21 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.83 GiB   12.83 GiB  203.43 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.45 MiB   89.45 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:2:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11591
[PRINT:2:0] Iteration 1 Phase 2 build [s]: 0.00233919
[PRINT:2:0] Iteration 1 Phase 2 run [s]: 3.68048
[PRINT:2:0] Iteration 1 TOTAL [s]: 5.79873
[PRINT:2:0] Statistics:
 - event-loop-check-future-finish:       358.01 ms (avg 808.08 ns)
 - event-loop-init-gpu-data-send:        1.43 s (avg 3.23 us)
 - event-loop-metadata-recv:             399.03 ms (avg 900.67 ns)
 - event-loop-metadata-send:             2.05 s (avg 4.63 us)
 - event-loop-post-incoming-chunk-recv:  3.83 s (avg 8.64 us)
 - event-loop-total:                     16.76 s (avg 21.64 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 339.34 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 339.86 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            15.20 ms (avg 47.22 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   15.12 GiB   15.12 GiB  203.42 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:3:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11639
[PRINT:3:0] Iteration 1 Phase 2 build [s]: 0.00233966
[PRINT:3:0] Iteration 1 Phase 2 run [s]: 3.68043
[PRINT:3:0] Iteration 1 TOTAL [s]: 5.79917
[PRINT:3:0] Statistics:
 - event-loop-check-future-finish:       274.42 ms (avg 503.04 ns)
 - event-loop-init-gpu-data-send:        2.20 s (avg 4.03 us)
 - event-loop-metadata-recv:             623.27 ms (avg 1.14 us)
 - event-loop-metadata-send:             2.93 s (avg 5.38 us)
 - event-loop-post-incoming-chunk-recv:  1.36 s (avg 2.49 us)
 - event-loop-total:                     15.50 s (avg 17.59 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.49 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 342.25 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            15.23 ms (avg 47.29 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.06 GiB   11.06 GiB  203.43 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:4:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11623
[PRINT:4:0] Iteration 1 Phase 2 build [s]: 0.00236146
[PRINT:4:0] Iteration 1 Phase 2 run [s]: 3.681
[PRINT:4:0] Iteration 1 TOTAL [s]: 5.7996
[PRINT:4:0] Statistics:
 - event-loop-check-future-finish:       315.24 ms (avg 661.16 ns)
 - event-loop-init-gpu-data-send:        2.08 s (avg 4.36 us)
 - event-loop-metadata-recv:             402.10 ms (avg 843.33 ns)
 - event-loop-metadata-send:             2.05 s (avg 4.30 us)
 - event-loop-post-incoming-chunk-recv:  3.22 s (avg 6.76 us)
 - event-loop-total:                     16.89 s (avg 20.72 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.67 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 337.34 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.84 ms (avg 39.88 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   10.94 GiB   10.94 GiB  201.30 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.47 MiB   89.47 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:5:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11273
[PRINT:5:0] Iteration 1 Phase 2 build [s]: 0.00240283
[PRINT:5:0] Iteration 1 Phase 2 run [s]: 3.6847
[PRINT:7:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11394
[PRINT:5:0] Iteration 1 TOTAL [s]: 5.79983
[PRINT:7:0] Iteration 1 Phase 2 build [s]: 0.00240039
[PRINT:7:0] Iteration 1 Phase 2 run [s]: 3.68353
[PRINT:7:0] Iteration 1 TOTAL [s]: 5.79987
[PRINT:5:0] Statistics:
 - event-loop-check-future-finish:       275.69 ms (avg 760.63 ns)
 - event-loop-init-gpu-data-send:        2.97 s (avg 8.19 us)
 - event-loop-metadata-recv:             1.04 s (avg 2.88 us)
 - event-loop-metadata-send:             2.12 s (avg 5.86 us)
 - event-loop-post-incoming-chunk-recv:  3.48 s (avg 9.60 us)
 - event-loop-total:                     20.28 s (avg 30.78 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 339.03 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 335.48 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.69 ms (avg 39.42 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.31 GiB   12.31 GiB  199.46 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.47 MiB   89.47 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11326
[PRINT:6:0] Iteration 1 Phase 2 build [s]: 0.00113433
[PRINT:7:0] Statistics:
 - event-loop-check-future-finish:       292.63 ms (avg 668.17 ns)
 - event-loop-init-gpu-data-send:        2.45 s (avg 5.59 us)
 - event-loop-metadata-recv:             985.09 ms (avg 2.25 us)
 - event-loop-metadata-send:             2.02 s (avg 4.61 us)
 - event-loop-post-incoming-chunk-recv:  3.31 s (avg 7.56 us)
 - event-loop-total:                     18.67 s (avg 24.95 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 337.05 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 338.33 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            12.48 ms (avg 38.75 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   12.36 GiB   12.36 GiB  199.46 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:6:0] Iteration 1 Phase 2 run [s]: 3.68562
[PRINT:6:0] Iteration 1 TOTAL [s]: 5.80001
[PRINT:6:0] Statistics:
 - event-loop-check-future-finish:       257.66 ms (avg 579.47 ns)
 - event-loop-init-gpu-data-send:        3.30 s (avg 7.43 us)
 - event-loop-metadata-recv:             965.16 ms (avg 2.17 us)
 - event-loop-metadata-send:             1.93 s (avg 4.34 us)
 - event-loop-post-incoming-chunk-recv:  2.06 s (avg 4.62 us)
 - event-loop-total:                     17.89 s (avg 23.20 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.22 KiB)
 - shuffle-payload-send:                 6.60 GiB (avg 334.54 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            14.10 ms (avg 43.79 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   13.33 GiB   13.33 GiB  199.47 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.60 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.62 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     322   89.48 MiB   89.48 MiB    6.65 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

[PRINT:0:4] Wrote 100 rows to q18-debug-sf1000.pq
[PRINT:0:0] Iteration 1 Phase 1 (groupby+filter) [s]: 2.11477
[PRINT:0:0] Iteration 1 Phase 2 build [s]: 0.00239231
[PRINT:0:0] Iteration 1 Phase 2 run [s]: 3.68461
[PRINT:0:0] Iteration 1 TOTAL [s]: 5.80178
[PRINT:0:0] Statistics:
 - event-loop-check-future-finish:       611.96 ms (avg 1.06 us)
 - event-loop-init-gpu-data-send:        3.58 s (avg 6.21 us)
 - event-loop-metadata-recv:             371.44 ms (avg 643.43 ns)
 - event-loop-metadata-send:             392.29 ms (avg 679.55 ns)
 - event-loop-post-incoming-chunk-recv:  1.05 s (avg 1.82 us)
 - event-loop-total:                     12.81 s (avg 13.54 us)
 - shuffle-payload-recv:                 6.64 GiB (avg 338.23 KiB)
 - shuffle-payload-send:                 6.69 GiB (avg 338.95 KiB)
 - spill-bytes-host-to-device:           0.00 B (avg 0.00 B)
 - spill-time-host-to-device:            17.01 ms (avg 52.50 us)

Memory Profiling
----------------
Legends:
  ncalls - number of times the scope was executed.
  peak   - peak memory usage by the scope.
  g-peak - global peak memory usage during the scope's execution.
  accum  - total accumulated memory allocations by the scope.

Ordered by: peak (descending)

  ncalls        peak      g-peak       accum  filename:lineno(name)
       1   11.35 GiB   11.35 GiB  203.45 GiB  main (all allocations using RmmResourceAdaptor)
     102    1.05 GiB    1.05 GiB   16.82 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:93(partition_and_pack)
     102  536.61 MiB  536.61 MiB    6.71 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:132(split_and_pack)
     324   89.47 MiB   89.47 MiB    6.66 GiB  /raid/rzamora/rapids_2602/rapidsmpf/cpp/src/integrations/cudf/partition.cpp:163(unpack_and_concat)

Limitation:
  - A scope only tracks allocations made by the thread that entered it.

@nirandaperera
Copy link
Contributor

@rjzamora It's best rebasing this with main ASAP. Otherwise it could be a merge hell later on.

@rjzamora rjzamora marked this pull request as ready for review December 9, 2025 17:49
@rjzamora rjzamora requested review from a team as code owners December 9, 2025 17:49
@rjzamora rjzamora changed the title [WIP] TPCH-derived Q18 TPCH-derived Q18 Dec 9, 2025
ctx, right_msg.release<rapidsmpf::streaming::TableChunk>()
);

auto stream = left_chunk.stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that we need to do something to ensure that right_table is valid on stream before performing the cudf::hash_join on stream. We do have rapidsmpf::cuda_stream_join, which might work.

(as an aside, I think that this is one of the issues compute-sanitizer is supposed to help with).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yeah, I added cuda_stream_join, but I'll probably need to use compute-sanitizer or something to check if the sequence of operations is "correct"

Comment on lines +629 to +631
auto match = joiner.semi_join(
table.select({key_column_idx}), chunk_stream, ctx->br()->device_mr()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: this differs from Q4. In Q4 have a relatively left / probe table, and a relatively large right / build table.

@TomAugspurger
Copy link
Contributor

We'll need to sync this (and the other by-hand implementations) with #731, which changes how BufferResources are created.

for (auto&& r : results) {
std::ranges::move(r.results, std::back_inserter(cols));
}
local_result = std::make_unique<cudf::table>(std::move(cols));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we release table at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this by moving the chunk in a local context after this. Let me know if there is a better way.

Comment on lines +938 to +947
auto sorted = cudf::sort_by_key(
global_result->view(),
global_result->view().select({4, 3}), // o_totalprice DESC, o_orderdate ASC
{cudf::order::DESCENDING, cudf::order::ASCENDING},
{cudf::null_order::AFTER, cudf::null_order::AFTER},
stream,
ctx->br()->device_mr()
);
cudf::size_type limit = std::min(100, sorted->num_rows());
auto limited = cudf::slice(sorted->view(), {0, limit})[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldnt we use cudf::top_k here? It would be more efficient I think.
Do we need the top 100 to be in order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out a clean way to use cudf::top_k_order when we are sorting on multiple columns. We may may be able to tackle it in two phases, but I'm pretty sure we are dealing with a relatively-small number of columns at this point anyway (even at sf3000). Perhaps if further profiling suggests I'm wrong about this, we can revisit?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 4, 2026

@rjzamora rjzamora#1 has the changes to use some of the streaming groupby / join utilities that we have on main now. Mind if I push that here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants