Skip to content

Comments

Issue 20 reduce sum#27

Open
debashishc wants to merge 3 commits intoKernel-Heim:mainfrom
debashishc:issue-20-reduce-sum
Open

Issue 20 reduce sum#27
debashishc wants to merge 3 commits intoKernel-Heim:mainfrom
debashishc:issue-20-reduce-sum

Conversation

@debashishc
Copy link
Contributor

PR up: reduce_sum kernel + benchmark cleanup. Better late than never, cleaned up the implementation a bit. Working on reducing for loops next.

Highlights:

  • Vectorized 128‑bit loads (CopyAtom + TV layout)
  • GMEM→SMEM via cp.async, then SMEM→registers
  • Thread‑local reduction + warp shuffle reduction
  • Block reduction via shared memory
  • Alignment/divisibility constraints for 128‑bit copies
  • Removed explicit FP32 accumulation in ref to cut overhead

Next step: remove the tail loop via a masked copy path to keep vectorized loads across the full row.

Refs #20

@debashishc debashishc requested a review from Tcc0403 February 4, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant