Skip to content

AcceleratedKernels.sum is twice as slow than CUDA.jl's mapreduce implementation #69

@VarLad

Description

@VarLad
julia> using CUDA, AcceleratedKernels, BenchmarkTools

julia> X_cpu = rand(Float32, 10000, 10000);

julia> X_cuda = CuArray(X_cpu)

julia> @benchmark CUDA.synchronize(sum($X_cuda, dims=1))

BenchmarkTools.Trial: 2173 samples with 1 evaluation per sample.
 Range (min … max):  2.115 ms …  2.532 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.277 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.283 ms ± 68.805 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂▃▆▃▇▅▇▅█▇▅▅▄▄▃▃▂▂▃▅▃▅▂▄▃▄▂▁▄▁                
  ▁▁▂▁▂▂▃▃▃▄▄▅▇███████████████████████████████▇▇▅▇▇▄▆▅▄▃▃▂▂▂ ▅
  2.12 ms        Histogram: frequency by time        2.44 ms <

 Memory estimate: 4.81 KiB, allocs estimate: 286.

julia> @benchmark CUDA.synchronize(AcceleratedKernels.sum($X_cuda, dims=1))

BenchmarkTools.Trial: 953 samples with 1 evaluation per sample.
 Range (min … max):  5.084 ms …  5.461 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.220 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.229 ms ± 64.660 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▂▂▄▄▆▃█▅▃▆▅▅▂  ▃ ▁       ▁ ▁                 
  ▂▃▃▃▂▃▃▄▃▃▄▆▅▅█████████████████▇█▇▇█▄▆▆████▇▇▅▆▅▄▁▃▂▂▂▂▂▂▂ ▄
  5.08 ms        Histogram: frequency by time        5.39 ms <

 Memory estimate: 5.50 KiB, allocs estimate: 306.

I'm benchmarking the shmem version of CUDA.jl for a fair comparison. The warp shuffle version is slightly faster.
The CUDA.jl implementation is here

I set shuffle to false here so this uses the shmem based implementation, without using warp shuffles.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions