`AcceleratedKernels.sum` is twice as slow than CUDA.jl's mapreduce implementation

```
julia> using CUDA, AcceleratedKernels, BenchmarkTools

julia> X_cpu = rand(Float32, 10000, 10000);

julia> X_cuda = CuArray(X_cpu)

julia> @benchmark CUDA.synchronize(sum($X_cuda, dims=1))

BenchmarkTools.Trial: 2173 samples with 1 evaluation per sample.
 Range (min … max):  2.115 ms …  2.532 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.277 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.283 ms ± 68.805 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂▃▆▃▇▅▇▅█▇▅▅▄▄▃▃▂▂▃▅▃▅▂▄▃▄▂▁▄▁                
  ▁▁▂▁▂▂▃▃▃▄▄▅▇███████████████████████████████▇▇▅▇▇▄▆▅▄▃▃▂▂▂ ▅
  2.12 ms        Histogram: frequency by time        2.44 ms <

 Memory estimate: 4.81 KiB, allocs estimate: 286.

julia> @benchmark CUDA.synchronize(AcceleratedKernels.sum($X_cuda, dims=1))

BenchmarkTools.Trial: 953 samples with 1 evaluation per sample.
 Range (min … max):  5.084 ms …  5.461 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.220 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.229 ms ± 64.660 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▁▂▂▄▄▆▃█▅▃▆▅▅▂  ▃ ▁       ▁ ▁                 
  ▂▃▃▃▂▃▃▄▃▃▄▆▅▅█████████████████▇█▇▇█▄▆▆████▇▇▅▆▅▄▁▃▂▂▂▂▂▂▂ ▄
  5.08 ms        Histogram: frequency by time        5.39 ms <

 Memory estimate: 5.50 KiB, allocs estimate: 306.
```

I'm benchmarking the shmem version of CUDA.jl for a fair comparison. The warp shuffle version is slightly faster. 
The CUDA.jl implementation is [here](https://github.com/JuliaGPU/CUDA.jl/blob/master/src/mapreduce.jl)

I set shuffle to `false` [here](https://github.com/JuliaGPU/CUDA.jl/blob/master/src/mapreduce.jl#L179) so this uses the shmem based implementation, without using warp shuffles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`AcceleratedKernels.sum` is twice as slow than CUDA.jl's mapreduce implementation #69

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AcceleratedKernels.sum is twice as slow than CUDA.jl's mapreduce implementation #69

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`AcceleratedKernels.sum` is twice as slow than CUDA.jl's mapreduce implementation #69