Skip to content

Question About Bandwidth Calculation in Benchmarks #91

@zouzias

Description

@zouzias

First of all, thank you so much for open-sourcing the Parlay toolkit and for all the hard work you’ve put into making it so accessible and easy to work with.

I have a concrete question regarding the scan primitive and the definition of bandwidth in your benchmark suite. In particular, for scan, the bandwidth (BW) for input type T and length n is defined as

BW := n * (3 * sizeof(T) + 0.7 * sizeof(T))

see macro REPORT_STATS and bench_scan_add.

I would expect the bandwidth to be defined as BW := n * (sizeof(T) + sizeof(T)). Do you also measure the intermediate up/down sweep scan phases in your bandwidth calculation?

The concern with the above bandwidth calculation is that it could exceed the peak memory bandwidth of the device, see below for an AMD EPYC 9654 96-Core Processor processor example: bench_scan_add<long>/5000000 achieves 540.137 GB/s and peak bandwidth is 460.8 GB/s)

Steps to reproduce

Patch the file benchmark/bench_standard.cpp with

+BENCH(scan_add, long, 10000/PSIZE_FACTOR);
+BENCH(scan_add, long, 100000/PSIZE_FACTOR);
+BENCH(scan_add, long, 500000/PSIZE_FACTOR);
+BENCH(scan_add, long, 1000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 5000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 10000000/PSIZE_FACTOR);
+BENCH(scan_add, long, 50000000/PSIZE_FACTOR);

To compile,

mkdir -p build/Release && cd build/Release
cmake -DCMAKE_BUILD_TYPE=Release -DPARLAY_BENCHMARK=On ../..

To run (jemalloc is really important here),

LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so numactl -i all ./benchmark/bench_standard   --benchmark_repetitions=5  --benchmark_report_aggregates_only=true | tee parlay-bench-results.log

Environment used

Ubuntu 22.04.5 LTS
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04.2) 

Output of lscpu

Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9654 96-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      96
Socket(s):                               1
Stepping:                                1
...

Memory bandwidth of AMD EPYC 9654 96-Core Processor is 460.8 GB/s : https://www.techpowerup.com/cpu-specs/epyc-9654.c2933

Benchmark Logs

Then, type cat parlay-bench-results.log |grep scan |grep mean to get

bench_scan_add<long>/10000/real_time_mean                    0.034 ms        0.035 ms            5        Bandwidth=8.61179G/s        Bytes/sec=2.32751G/s     Elements/sec=290.939M/s
bench_scan_add<long>/100000/real_time_mean                   0.042 ms        0.042 ms            5        Bandwidth=70.9254G/s        Bytes/sec=19.169G/s     Elements/sec=2.39613G/s
bench_scan_add<long>/500000/real_time_mean                   0.055 ms        0.056 ms            5        Bandwidth=267.007G/s        Bytes/sec=72.1641G/s     Elements/sec=9.02051G/s
bench_scan_add<long>/1000000/real_time_mean                  0.081 ms        0.081 ms            5        Bandwidth=366.014G/s        Bytes/sec=98.9226G/s     Elements/sec=12.3653G/s
bench_scan_add<long>/5000000/real_time_mean                  0.274 ms        0.274 ms            5        Bandwidth=540.137G/s        Bytes/sec=145.983G/s     Elements/sec=18.2479G/s
bench_scan_add<long>/10000000/real_time_mean                 0.943 ms        0.943 ms            5        Bandwidth=313.869G/s        Bytes/sec=84.8295G/s     Elements/sec=10.6037G/s
bench_scan_add<long>/50000000/real_time_mean                  7.53 ms         7.34 ms            5        Bandwidth=196.768G/s        Bytes/sec=53.1806G/s     Elements/sec=6.64758G/s
bench_scan_add<long>/100000000/real_time_mean                 14.5 ms         14.4 ms            5        Bandwidth=204.6G/s        Bytes/sec=55.2972G/s     Elements/sec=6.91215G/s

Extended logs

------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
bench_map<long>/100000000/real_time_mean                      11.5 ms         11.4 ms            5        Bandwidth=189.002G/s        Bytes/sec=70.0006G/s     Elements/sec=8.75007G/s
bench_map<long>/100000000/real_time_median                    11.7 ms         11.7 ms            5        Bandwidth=184.669G/s        Bytes/sec=68.3959G/s     Elements/sec=8.54948G/s
bench_map<long>/100000000/real_time_stddev                   0.541 ms        0.544 ms            5        Bandwidth=9.53313G/s        Bytes/sec=3.53079G/s     Elements/sec=441.348M/s
bench_map<long>/100000000/real_time_cv                        4.72 %          4.76 %             5        Bandwidth=5.04%        Bytes/sec=5.04%     Elements/sec=5.04%
bench_tabulate<long>/100000000/real_time_mean                 9.93 ms         9.70 ms            5        Bandwidth=137.004G/s        Bytes/sec=80.5907G/s     Elements/sec=10.0738G/s
bench_tabulate<long>/100000000/real_time_median               9.93 ms         9.89 ms            5        Bandwidth=136.904G/s        Bytes/sec=80.5319G/s     Elements/sec=10.0665G/s
bench_tabulate<long>/100000000/real_time_stddev              0.046 ms        0.488 ms            5        Bandwidth=639.364M/s        Bytes/sec=376.096M/s     Elements/sec=47.012M/s
bench_tabulate<long>/100000000/real_time_cv                   0.47 %          5.03 %             5        Bandwidth=0.47%        Bytes/sec=0.47%     Elements/sec=0.47%
bench_reduce_add<long>/100000000/real_time_mean               3.26 ms         3.26 ms            5        Bandwidth=245.301G/s        Bytes/sec=245.301G/s     Elements/sec=30.6627G/s
bench_reduce_add<long>/100000000/real_time_median             3.26 ms         3.26 ms            5        Bandwidth=245.431G/s        Bytes/sec=245.431G/s     Elements/sec=30.6789G/s
bench_reduce_add<long>/100000000/real_time_stddev            0.008 ms        0.011 ms            5        Bandwidth=606.313M/s        Bytes/sec=606.313M/s     Elements/sec=75.7891M/s
bench_reduce_add<long>/100000000/real_time_cv                 0.25 %          0.35 %             5        Bandwidth=0.25%        Bytes/sec=0.25%     Elements/sec=0.25%
bench_scan_add<long>/10000/real_time_mean                    0.034 ms        0.035 ms            5        Bandwidth=8.61179G/s        Bytes/sec=2.32751G/s     Elements/sec=290.939M/s
bench_scan_add<long>/10000/real_time_median                  0.034 ms        0.035 ms            5        Bandwidth=8.61056G/s        Bytes/sec=2.32718G/s     Elements/sec=290.897M/s
bench_scan_add<long>/10000/real_time_stddev                  0.000 ms        0.000 ms            5        Bandwidth=8.68314M/s        Bytes/sec=2.34679M/s     Elements/sec=293.349k/s
bench_scan_add<long>/10000/real_time_cv                       0.10 %          0.19 %             5        Bandwidth=0.10%        Bytes/sec=0.10%     Elements/sec=0.10%
bench_scan_add<long>/100000/real_time_mean                   0.042 ms        0.042 ms            5        Bandwidth=70.9254G/s        Bytes/sec=19.169G/s     Elements/sec=2.39613G/s
bench_scan_add<long>/100000/real_time_median                 0.042 ms        0.042 ms            5        Bandwidth=70.955G/s        Bytes/sec=19.177G/s     Elements/sec=2.39713G/s
bench_scan_add<long>/100000/real_time_stddev                 0.000 ms        0.000 ms            5        Bandwidth=167.211M/s        Bytes/sec=45.1922M/s     Elements/sec=5.64903M/s
bench_scan_add<long>/100000/real_time_cv                      0.24 %          0.16 %             5        Bandwidth=0.24%        Bytes/sec=0.24%     Elements/sec=0.24%
bench_scan_add<long>/500000/real_time_mean                   0.055 ms        0.056 ms            5        Bandwidth=267.007G/s        Bytes/sec=72.1641G/s     Elements/sec=9.02051G/s
bench_scan_add<long>/500000/real_time_median                 0.055 ms        0.056 ms            5        Bandwidth=267.19G/s        Bytes/sec=72.2136G/s     Elements/sec=9.0267G/s
bench_scan_add<long>/500000/real_time_stddev                 0.000 ms        0.000 ms            5        Bandwidth=542.294M/s        Bytes/sec=146.566M/s     Elements/sec=18.3208M/s
bench_scan_add<long>/500000/real_time_cv                      0.20 %          0.15 %             5        Bandwidth=0.20%        Bytes/sec=0.20%     Elements/sec=0.20%
bench_scan_add<long>/1000000/real_time_mean                  0.081 ms        0.081 ms            5        Bandwidth=366.014G/s        Bytes/sec=98.9226G/s     Elements/sec=12.3653G/s
bench_scan_add<long>/1000000/real_time_median                0.081 ms        0.081 ms            5        Bandwidth=366.994G/s        Bytes/sec=99.1875G/s     Elements/sec=12.3984G/s
bench_scan_add<long>/1000000/real_time_stddev                0.001 ms        0.001 ms            5        Bandwidth=2.82184G/s        Bytes/sec=762.659M/s     Elements/sec=95.3324M/s
bench_scan_add<long>/1000000/real_time_cv                     0.78 %          0.75 %             5        Bandwidth=0.77%        Bytes/sec=0.77%     Elements/sec=0.77%
bench_scan_add<long>/5000000/real_time_mean                  0.274 ms        0.274 ms            5        Bandwidth=540.137G/s        Bytes/sec=145.983G/s     Elements/sec=18.2479G/s
bench_scan_add<long>/5000000/real_time_median                0.274 ms        0.274 ms            5        Bandwidth=540.408G/s        Bytes/sec=146.056G/s     Elements/sec=18.257G/s
bench_scan_add<long>/5000000/real_time_stddev                0.001 ms        0.001 ms            5        Bandwidth=1.64528G/s        Bytes/sec=444.671M/s     Elements/sec=55.5839M/s
bench_scan_add<long>/5000000/real_time_cv                     0.31 %          0.25 %             5        Bandwidth=0.30%        Bytes/sec=0.30%     Elements/sec=0.30%
bench_scan_add<long>/10000000/real_time_mean                 0.943 ms        0.943 ms            5        Bandwidth=313.869G/s        Bytes/sec=84.8295G/s     Elements/sec=10.6037G/s
bench_scan_add<long>/10000000/real_time_median               0.944 ms        0.944 ms            5        Bandwidth=313.723G/s        Bytes/sec=84.7899G/s     Elements/sec=10.5987G/s
bench_scan_add<long>/10000000/real_time_stddev               0.003 ms        0.002 ms            5        Bandwidth=861.402M/s        Bytes/sec=232.811M/s     Elements/sec=29.1014M/s
bench_scan_add<long>/10000000/real_time_cv                    0.27 %          0.25 %             5        Bandwidth=0.27%        Bytes/sec=0.27%     Elements/sec=0.27%
bench_scan_add<long>/50000000/real_time_mean                  7.53 ms         7.34 ms            5        Bandwidth=196.768G/s        Bytes/sec=53.1806G/s     Elements/sec=6.64758G/s
bench_scan_add<long>/50000000/real_time_median                7.59 ms         7.32 ms            5        Bandwidth=194.903G/s        Bytes/sec=52.6765G/s     Elements/sec=6.58456G/s
bench_scan_add<long>/50000000/real_time_stddev               0.203 ms        0.252 ms            5        Bandwidth=5.48949G/s        Bytes/sec=1.48364G/s     Elements/sec=185.456M/s
bench_scan_add<long>/50000000/real_time_cv                    2.69 %          3.44 %             5        Bandwidth=2.79%        Bytes/sec=2.79%     Elements/sec=2.79%
bench_scan_add<long>/100000000/real_time_mean                 14.5 ms         14.4 ms            5        Bandwidth=204.6G/s        Bytes/sec=55.2972G/s     Elements/sec=6.91215G/s
bench_scan_add<long>/100000000/real_time_median               14.5 ms         14.4 ms            5        Bandwidth=204.573G/s        Bytes/sec=55.29G/s     Elements/sec=6.91126G/s
bench_scan_add<long>/100000000/real_time_stddev              0.012 ms        0.081 ms            5        Bandwidth=168.984M/s        Bytes/sec=45.6714M/s     Elements/sec=5.70892M/s
bench_scan_add<long>/100000000/real_time_cv                   0.08 %          0.56 %             5        Bandwidth=0.08%        Bytes/sec=0.08%     Elements/sec=0.08%
bench_sort<__int128>/100000000/real_time_mean                  121 ms          121 ms            5        Bandwidth=0/s        Bytes/sec=13.1943G/s     Elements/sec=824.644M/s
bench_sort<__int128>/100000000/real_time_median                120 ms          120 ms            5        Bandwidth=0/s        Bytes/sec=13.3126G/s     Elements/sec=832.036M/s
bench_sort<__int128>/100000000/real_time_stddev               2.46 ms         2.50 ms            5        Bandwidth=0/s        Bytes/sec=260.569M/s     Elements/sec=16.2855M/s
bench_sort<__int128>/100000000/real_time_cv                   2.03 %          2.07 %             5        Bandwidth=-nan%        Bytes/sec=1.97%     Elements/sec=1.97%
bench_sort_inplace<__int128>/100000000/real_time_mean          126 ms          126 ms            5        Bandwidth=0/s        Bytes/sec=12.6869G/s     Elements/sec=792.931M/s
bench_sort_inplace<__int128>/100000000/real_time_median        126 ms          126 ms            5        Bandwidth=0/s        Bytes/sec=12.6959G/s     Elements/sec=793.492M/s
bench_sort_inplace<__int128>/100000000/real_time_stddev      0.538 ms        0.498 ms            5        Bandwidth=0/s        Bytes/sec=53.9039M/s     Elements/sec=3.36899M/s
bench_sort_inplace<__int128>/100000000/real_time_cv           0.43 %          0.39 %             5        Bandwidth=-nan%        Bytes/sec=0.42%     Elements/sec=0.42%
bench_integer_sort<__int128>/100000000/real_time_mean          163 ms          163 ms            5        Bandwidth=0/s        Bytes/sec=10.2283G/s     Elements/sec=639.267M/s
bench_integer_sort<__int128>/100000000/real_time_median        144 ms          144 ms            5        Bandwidth=0/s        Bytes/sec=11.0875G/s     Elements/sec=692.966M/s
bench_integer_sort<__int128>/100000000/real_time_stddev       41.3 ms         41.1 ms            5        Bandwidth=0/s        Bytes/sec=1.93574G/s     Elements/sec=120.984M/s
bench_integer_sort<__int128>/100000000/real_time_cv          25.37 %         25.28 %             5        Bandwidth=-nan%        Bytes/sec=18.93%     Elements/sec=18.93%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions