Skip to content

Add reduce_sum op#24

Draft
Tcc0403 wants to merge 5 commits intoKernel-Heim:mainfrom
Tcc0403:tcc/reduce-sum
Draft

Add reduce_sum op#24
Tcc0403 wants to merge 5 commits intoKernel-Heim:mainfrom
Tcc0403:tcc/reduce-sum

Conversation

@Tcc0403
Copy link
Contributor

@Tcc0403 Tcc0403 commented Feb 1, 2026

This PR adds a simple reduce_sum kernel, which can be a starting point for new starters.

Strategies

  • leveraging TV layout for vectorized load
  • utilizing warp reduction intrinsic to avoid syncrhonization and reduce memory pressure
  • skipping shared memory if reduction dimension is small
  • performing reduction in higher precision for numerical stability

Benchmark results

I only did a simple benchmark on my local machine, and it already performs better than torch's.

On RTX 3080

❯ uv run bench/benchmark_reduce_sum.py --m 4096 --n 4096 --compile-ref
copy_transpose p50: 0.0655 ms, BW: 512.13 GB/s
reference p50: 0.1024 ms, BW: 327.76 GB/s

Limitations

Future works

related issue: #20

Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Signed-off-by: Tcc0403 <76503978+Tcc0403@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant