-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
julia> using CUDA, AcceleratedKernels, BenchmarkTools
julia> X_cpu = rand(Float32, 10000, 10000);
julia> X_cuda = CuArray(X_cpu)
julia> @benchmark CUDA.synchronize(sum($X_cuda, dims=1))
BenchmarkTools.Trial: 2173 samples with 1 evaluation per sample.
Range (min … max): 2.115 ms … 2.532 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.277 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.283 ms ± 68.805 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃▆▃▇▅▇▅█▇▅▅▄▄▃▃▂▂▃▅▃▅▂▄▃▄▂▁▄▁
▁▁▂▁▂▂▃▃▃▄▄▅▇███████████████████████████████▇▇▅▇▇▄▆▅▄▃▃▂▂▂ ▅
2.12 ms Histogram: frequency by time 2.44 ms <
Memory estimate: 4.81 KiB, allocs estimate: 286.
julia> @benchmark CUDA.synchronize(AcceleratedKernels.sum($X_cuda, dims=1))
BenchmarkTools.Trial: 953 samples with 1 evaluation per sample.
Range (min … max): 5.084 ms … 5.461 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.220 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.229 ms ± 64.660 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▂▄▄▆▃█▅▃▆▅▅▂ ▃ ▁ ▁ ▁
▂▃▃▃▂▃▃▄▃▃▄▆▅▅█████████████████▇█▇▇█▄▆▆████▇▇▅▆▅▄▁▃▂▂▂▂▂▂▂ ▄
5.08 ms Histogram: frequency by time 5.39 ms <
Memory estimate: 5.50 KiB, allocs estimate: 306.
I'm benchmarking the shmem version of CUDA.jl for a fair comparison. The warp shuffle version is slightly faster.
The CUDA.jl implementation is here
I set shuffle to false here so this uses the shmem based implementation, without using warp shuffles.
Metadata
Metadata
Assignees
Labels
No labels