Add missing __syncwarp() in reduce kernel + CUDA 13.0 build fix by cuzelac · Pull Request #15 · JeffreyXiang/FlexGEMM

cuzelac · 2026-03-08T10:56:57Z

Summary

Two fixes discovered while debugging Trellis2 on NVIDIA Blackwell (RTX 5090, sm_120):

1. Missing `__syncwarp()` in warp-level reduction

reduce_code_cuda_kernel in neighbor_map.cu performs a warp-level reduction over shared memory without __syncwarp() between iterations. Each iteration reads buf[threadIdx.x + cur_len] after the prior iteration wrote to buf[threadIdx.x]. Without a warp barrier, there is no guarantee the write is visible to other threads before the next read.

While current NVIDIA hardware executes warps in lockstep, this is undefined behavior per the CUDA programming model and may break on future architectures.

2. `-allow-unsupported-compiler` for CUDA 13.0

CUDA 13.0 with MSVC 2025 (Visual Studio 18) fails to compile without this flag, as nvcc only officially supports MSVC 2019-2022.

Testing

Built and tested on RTX 5090 (sm_120), PyTorch 2.10.0+cu130, CUDA 13.0, Windows
Full Trellis2 image-to-3D pipeline including sparse convolution operations
Multiple successful runs

The warp-level reduction loop in reduce_code_cuda_kernel reads from shared memory at buf[threadIdx.x + cur_len] after a prior iteration wrote to buf[threadIdx.x]. Without __syncwarp(), there is no guarantee that the write is visible to other threads in the warp before the next iteration reads it. While current NVIDIA hardware executes warps in lockstep, this is undefined behavior per the CUDA programming model and may break on future architectures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CUDA 13.0 with MSVC 2025 (Visual Studio 18) fails to compile without this flag, as nvcc only officially supports MSVC 2019-2022. The flag allows compilation on newer toolchains without affecting runtime behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cuzelac · 2026-03-08T11:10:27Z

This may also address:

cuda 13.0 support #8 (CUDA 13.0 build support) — the -allow-unsupported-compiler flag in this PR enables building on CUDA 13.0
Silent failure in RTX 5080 #13 (silent failure on RTX 5080) — the __syncwarp() fix or the related CuMesh stream fix may resolve this

JeffreyXiang · 2026-03-11T04:45:00Z

Thanks! Merged

cuzelac · 2026-03-11T20:39:02Z

Happy to contribute - thanks for your work on this!

cuzelac and others added 2 commits March 8, 2026 01:43

JeffreyXiang merged commit 9f2f050 into JeffreyXiang:main Mar 11, 2026

This was referenced Apr 8, 2026

Add AMD ROCm/HIP support (2-line fix) #17

Closed

Add AMD ROCm/HIP support (2-line fix) #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing __syncwarp() in reduce kernel + CUDA 13.0 build fix#15

Add missing __syncwarp() in reduce kernel + CUDA 13.0 build fix#15
JeffreyXiang merged 2 commits intoJeffreyXiang:mainfrom
cuzelac:fix/syncwarp-reduce-kernel

cuzelac commented Mar 8, 2026

Uh oh!

cuzelac commented Mar 8, 2026

Uh oh!

JeffreyXiang commented Mar 11, 2026

Uh oh!

cuzelac commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cuzelac commented Mar 8, 2026

Summary

1. Missing __syncwarp() in warp-level reduction

2. -allow-unsupported-compiler for CUDA 13.0

Testing

Uh oh!

cuzelac commented Mar 8, 2026

Uh oh!

JeffreyXiang commented Mar 11, 2026

Uh oh!

cuzelac commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Missing `__syncwarp()` in warp-level reduction

2. `-allow-unsupported-compiler` for CUDA 13.0