First of all, thank you so much for open-sourcing the Parlay toolkit and for all the hard work you’ve put into making it so accessible and easy to work with.
I have a concrete question regarding the scan primitive and the definition of bandwidth in your benchmark suite. In particular, for scan, the bandwidth (BW) for input type T and length n is defined as
BW := n * (3 * sizeof(T) + 0.7 * sizeof(T))
see macro REPORT_STATS and bench_scan_add.
I would expect the bandwidth to be defined as BW := n * (sizeof(T) + sizeof(T)). Do you also measure the intermediate up/down sweep scan phases in your bandwidth calculation?
The concern with the above bandwidth calculation is that it could exceed the peak memory bandwidth of the device, see below for an AMD EPYC 9654 96-Core Processor processor example: bench_scan_add<long>/5000000 achieves 540.137 GB/s and peak bandwidth is 460.8 GB/s)
Steps to reproduce
Patch the file benchmark/bench_standard.cpp with
+BENCH(scan_add, long, 10000/PSIZE_FACTOR);
+BENCH(scan_add, long, 100000/PSIZE_FACTOR);
+BENCH(scan_add, long, 500000/PSIZE_FACTOR);
+BENCH(scan_add, long, 1000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 5000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 10000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 50000000/PSIZE_FACTOR);
To compile,
mkdir -p build/Release && cd build/Release
cmake -DCMAKE_BUILD_TYPE=Release -DPARLAY_BENCHMARK=On ../..
To run (jemalloc is really important here),
LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so numactl -i all ./benchmark/bench_standard --benchmark_repetitions=5 --benchmark_report_aggregates_only=true | tee parlay-bench-results.log
Environment used
Ubuntu 22.04.5 LTS
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04.2)
Output of lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 1
Stepping: 1
...
Memory bandwidth of AMD EPYC 9654 96-Core Processor is 460.8 GB/s : https://www.techpowerup.com/cpu-specs/epyc-9654.c2933
Benchmark Logs
Then, type cat parlay-bench-results.log |grep scan |grep mean to get
bench_scan_add<long>/10000/real_time_mean 0.034 ms 0.035 ms 5 Bandwidth=8.61179G/s Bytes/sec=2.32751G/s Elements/sec=290.939M/s
bench_scan_add<long>/100000/real_time_mean 0.042 ms 0.042 ms 5 Bandwidth=70.9254G/s Bytes/sec=19.169G/s Elements/sec=2.39613G/s
bench_scan_add<long>/500000/real_time_mean 0.055 ms 0.056 ms 5 Bandwidth=267.007G/s Bytes/sec=72.1641G/s Elements/sec=9.02051G/s
bench_scan_add<long>/1000000/real_time_mean 0.081 ms 0.081 ms 5 Bandwidth=366.014G/s Bytes/sec=98.9226G/s Elements/sec=12.3653G/s
bench_scan_add<long>/5000000/real_time_mean 0.274 ms 0.274 ms 5 Bandwidth=540.137G/s Bytes/sec=145.983G/s Elements/sec=18.2479G/s
bench_scan_add<long>/10000000/real_time_mean 0.943 ms 0.943 ms 5 Bandwidth=313.869G/s Bytes/sec=84.8295G/s Elements/sec=10.6037G/s
bench_scan_add<long>/50000000/real_time_mean 7.53 ms 7.34 ms 5 Bandwidth=196.768G/s Bytes/sec=53.1806G/s Elements/sec=6.64758G/s
bench_scan_add<long>/100000000/real_time_mean 14.5 ms 14.4 ms 5 Bandwidth=204.6G/s Bytes/sec=55.2972G/s Elements/sec=6.91215G/s
Extended logs
------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
bench_map<long>/100000000/real_time_mean 11.5 ms 11.4 ms 5 Bandwidth=189.002G/s Bytes/sec=70.0006G/s Elements/sec=8.75007G/s
bench_map<long>/100000000/real_time_median 11.7 ms 11.7 ms 5 Bandwidth=184.669G/s Bytes/sec=68.3959G/s Elements/sec=8.54948G/s
bench_map<long>/100000000/real_time_stddev 0.541 ms 0.544 ms 5 Bandwidth=9.53313G/s Bytes/sec=3.53079G/s Elements/sec=441.348M/s
bench_map<long>/100000000/real_time_cv 4.72 % 4.76 % 5 Bandwidth=5.04% Bytes/sec=5.04% Elements/sec=5.04%
bench_tabulate<long>/100000000/real_time_mean 9.93 ms 9.70 ms 5 Bandwidth=137.004G/s Bytes/sec=80.5907G/s Elements/sec=10.0738G/s
bench_tabulate<long>/100000000/real_time_median 9.93 ms 9.89 ms 5 Bandwidth=136.904G/s Bytes/sec=80.5319G/s Elements/sec=10.0665G/s
bench_tabulate<long>/100000000/real_time_stddev 0.046 ms 0.488 ms 5 Bandwidth=639.364M/s Bytes/sec=376.096M/s Elements/sec=47.012M/s
bench_tabulate<long>/100000000/real_time_cv 0.47 % 5.03 % 5 Bandwidth=0.47% Bytes/sec=0.47% Elements/sec=0.47%
bench_reduce_add<long>/100000000/real_time_mean 3.26 ms 3.26 ms 5 Bandwidth=245.301G/s Bytes/sec=245.301G/s Elements/sec=30.6627G/s
bench_reduce_add<long>/100000000/real_time_median 3.26 ms 3.26 ms 5 Bandwidth=245.431G/s Bytes/sec=245.431G/s Elements/sec=30.6789G/s
bench_reduce_add<long>/100000000/real_time_stddev 0.008 ms 0.011 ms 5 Bandwidth=606.313M/s Bytes/sec=606.313M/s Elements/sec=75.7891M/s
bench_reduce_add<long>/100000000/real_time_cv 0.25 % 0.35 % 5 Bandwidth=0.25% Bytes/sec=0.25% Elements/sec=0.25%
bench_scan_add<long>/10000/real_time_mean 0.034 ms 0.035 ms 5 Bandwidth=8.61179G/s Bytes/sec=2.32751G/s Elements/sec=290.939M/s
bench_scan_add<long>/10000/real_time_median 0.034 ms 0.035 ms 5 Bandwidth=8.61056G/s Bytes/sec=2.32718G/s Elements/sec=290.897M/s
bench_scan_add<long>/10000/real_time_stddev 0.000 ms 0.000 ms 5 Bandwidth=8.68314M/s Bytes/sec=2.34679M/s Elements/sec=293.349k/s
bench_scan_add<long>/10000/real_time_cv 0.10 % 0.19 % 5 Bandwidth=0.10% Bytes/sec=0.10% Elements/sec=0.10%
bench_scan_add<long>/100000/real_time_mean 0.042 ms 0.042 ms 5 Bandwidth=70.9254G/s Bytes/sec=19.169G/s Elements/sec=2.39613G/s
bench_scan_add<long>/100000/real_time_median 0.042 ms 0.042 ms 5 Bandwidth=70.955G/s Bytes/sec=19.177G/s Elements/sec=2.39713G/s
bench_scan_add<long>/100000/real_time_stddev 0.000 ms 0.000 ms 5 Bandwidth=167.211M/s Bytes/sec=45.1922M/s Elements/sec=5.64903M/s
bench_scan_add<long>/100000/real_time_cv 0.24 % 0.16 % 5 Bandwidth=0.24% Bytes/sec=0.24% Elements/sec=0.24%
bench_scan_add<long>/500000/real_time_mean 0.055 ms 0.056 ms 5 Bandwidth=267.007G/s Bytes/sec=72.1641G/s Elements/sec=9.02051G/s
bench_scan_add<long>/500000/real_time_median 0.055 ms 0.056 ms 5 Bandwidth=267.19G/s Bytes/sec=72.2136G/s Elements/sec=9.0267G/s
bench_scan_add<long>/500000/real_time_stddev 0.000 ms 0.000 ms 5 Bandwidth=542.294M/s Bytes/sec=146.566M/s Elements/sec=18.3208M/s
bench_scan_add<long>/500000/real_time_cv 0.20 % 0.15 % 5 Bandwidth=0.20% Bytes/sec=0.20% Elements/sec=0.20%
bench_scan_add<long>/1000000/real_time_mean 0.081 ms 0.081 ms 5 Bandwidth=366.014G/s Bytes/sec=98.9226G/s Elements/sec=12.3653G/s
bench_scan_add<long>/1000000/real_time_median 0.081 ms 0.081 ms 5 Bandwidth=366.994G/s Bytes/sec=99.1875G/s Elements/sec=12.3984G/s
bench_scan_add<long>/1000000/real_time_stddev 0.001 ms 0.001 ms 5 Bandwidth=2.82184G/s Bytes/sec=762.659M/s Elements/sec=95.3324M/s
bench_scan_add<long>/1000000/real_time_cv 0.78 % 0.75 % 5 Bandwidth=0.77% Bytes/sec=0.77% Elements/sec=0.77%
bench_scan_add<long>/5000000/real_time_mean 0.274 ms 0.274 ms 5 Bandwidth=540.137G/s Bytes/sec=145.983G/s Elements/sec=18.2479G/s
bench_scan_add<long>/5000000/real_time_median 0.274 ms 0.274 ms 5 Bandwidth=540.408G/s Bytes/sec=146.056G/s Elements/sec=18.257G/s
bench_scan_add<long>/5000000/real_time_stddev 0.001 ms 0.001 ms 5 Bandwidth=1.64528G/s Bytes/sec=444.671M/s Elements/sec=55.5839M/s
bench_scan_add<long>/5000000/real_time_cv 0.31 % 0.25 % 5 Bandwidth=0.30% Bytes/sec=0.30% Elements/sec=0.30%
bench_scan_add<long>/10000000/real_time_mean 0.943 ms 0.943 ms 5 Bandwidth=313.869G/s Bytes/sec=84.8295G/s Elements/sec=10.6037G/s
bench_scan_add<long>/10000000/real_time_median 0.944 ms 0.944 ms 5 Bandwidth=313.723G/s Bytes/sec=84.7899G/s Elements/sec=10.5987G/s
bench_scan_add<long>/10000000/real_time_stddev 0.003 ms 0.002 ms 5 Bandwidth=861.402M/s Bytes/sec=232.811M/s Elements/sec=29.1014M/s
bench_scan_add<long>/10000000/real_time_cv 0.27 % 0.25 % 5 Bandwidth=0.27% Bytes/sec=0.27% Elements/sec=0.27%
bench_scan_add<long>/50000000/real_time_mean 7.53 ms 7.34 ms 5 Bandwidth=196.768G/s Bytes/sec=53.1806G/s Elements/sec=6.64758G/s
bench_scan_add<long>/50000000/real_time_median 7.59 ms 7.32 ms 5 Bandwidth=194.903G/s Bytes/sec=52.6765G/s Elements/sec=6.58456G/s
bench_scan_add<long>/50000000/real_time_stddev 0.203 ms 0.252 ms 5 Bandwidth=5.48949G/s Bytes/sec=1.48364G/s Elements/sec=185.456M/s
bench_scan_add<long>/50000000/real_time_cv 2.69 % 3.44 % 5 Bandwidth=2.79% Bytes/sec=2.79% Elements/sec=2.79%
bench_scan_add<long>/100000000/real_time_mean 14.5 ms 14.4 ms 5 Bandwidth=204.6G/s Bytes/sec=55.2972G/s Elements/sec=6.91215G/s
bench_scan_add<long>/100000000/real_time_median 14.5 ms 14.4 ms 5 Bandwidth=204.573G/s Bytes/sec=55.29G/s Elements/sec=6.91126G/s
bench_scan_add<long>/100000000/real_time_stddev 0.012 ms 0.081 ms 5 Bandwidth=168.984M/s Bytes/sec=45.6714M/s Elements/sec=5.70892M/s
bench_scan_add<long>/100000000/real_time_cv 0.08 % 0.56 % 5 Bandwidth=0.08% Bytes/sec=0.08% Elements/sec=0.08%
bench_sort<__int128>/100000000/real_time_mean 121 ms 121 ms 5 Bandwidth=0/s Bytes/sec=13.1943G/s Elements/sec=824.644M/s
bench_sort<__int128>/100000000/real_time_median 120 ms 120 ms 5 Bandwidth=0/s Bytes/sec=13.3126G/s Elements/sec=832.036M/s
bench_sort<__int128>/100000000/real_time_stddev 2.46 ms 2.50 ms 5 Bandwidth=0/s Bytes/sec=260.569M/s Elements/sec=16.2855M/s
bench_sort<__int128>/100000000/real_time_cv 2.03 % 2.07 % 5 Bandwidth=-nan% Bytes/sec=1.97% Elements/sec=1.97%
bench_sort_inplace<__int128>/100000000/real_time_mean 126 ms 126 ms 5 Bandwidth=0/s Bytes/sec=12.6869G/s Elements/sec=792.931M/s
bench_sort_inplace<__int128>/100000000/real_time_median 126 ms 126 ms 5 Bandwidth=0/s Bytes/sec=12.6959G/s Elements/sec=793.492M/s
bench_sort_inplace<__int128>/100000000/real_time_stddev 0.538 ms 0.498 ms 5 Bandwidth=0/s Bytes/sec=53.9039M/s Elements/sec=3.36899M/s
bench_sort_inplace<__int128>/100000000/real_time_cv 0.43 % 0.39 % 5 Bandwidth=-nan% Bytes/sec=0.42% Elements/sec=0.42%
bench_integer_sort<__int128>/100000000/real_time_mean 163 ms 163 ms 5 Bandwidth=0/s Bytes/sec=10.2283G/s Elements/sec=639.267M/s
bench_integer_sort<__int128>/100000000/real_time_median 144 ms 144 ms 5 Bandwidth=0/s Bytes/sec=11.0875G/s Elements/sec=692.966M/s
bench_integer_sort<__int128>/100000000/real_time_stddev 41.3 ms 41.1 ms 5 Bandwidth=0/s Bytes/sec=1.93574G/s Elements/sec=120.984M/s
bench_integer_sort<__int128>/100000000/real_time_cv 25.37 % 25.28 % 5 Bandwidth=-nan% Bytes/sec=18.93% Elements/sec=18.93%
First of all, thank you so much for open-sourcing the Parlay toolkit and for all the hard work you’ve put into making it so accessible and easy to work with.
I have a concrete question regarding the
scanprimitive and the definition ofbandwidthin your benchmark suite. In particular, forscan, the bandwidth (BW) for input typeTand lengthnis defined assee macro REPORT_STATS and bench_scan_add.
I would expect the bandwidth to be defined as
BW := n * (sizeof(T) + sizeof(T)). Do you also measure the intermediate up/down sweep scan phases in your bandwidth calculation?The concern with the above bandwidth calculation is that it could exceed the peak memory bandwidth of the device, see below for an
AMD EPYC 9654 96-Core Processorprocessor example:bench_scan_add<long>/5000000achieves540.137 GB/sand peak bandwidth is460.8 GB/s)Steps to reproduce
Patch the file
benchmark/bench_standard.cppwithTo compile,
To run (jemalloc is really important here),
LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so numactl -i all ./benchmark/bench_standard --benchmark_repetitions=5 --benchmark_report_aggregates_only=true | tee parlay-bench-results.logEnvironment used
Output of
lscpuMemory bandwidth of AMD EPYC 9654 96-Core Processor is
460.8 GB/s: https://www.techpowerup.com/cpu-specs/epyc-9654.c2933Benchmark Logs
Then, type
cat parlay-bench-results.log |grep scan |grep meanto getExtended logs