Skip to content

refactor(nn): modularize nn.cu into separate .inl files (#133)#137

Merged
m96-chan merged 2 commits intomainfrom
feature/issue-133-nn-modular
Dec 30, 2025
Merged

refactor(nn): modularize nn.cu into separate .inl files (#133)#137
m96-chan merged 2 commits intomainfrom
feature/issue-133-nn-modular

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • Split monolithic nn.cu (2673 lines) into modular .inl files matching binding structure
  • Created 9 subdirectories: activation/, norm/, rope/, linear/, attention/, tensor/, embedding/, elementwise/, cast/
  • Maintains single translation unit compilation to avoid LNK2005 duplicate symbol errors

Files Changed

Directory Files
activation/ gelu.inl, silu.inl, sigmoid.inl, tanh.inl
norm/ layernorm.inl, rmsnorm.inl
rope/ rope_inplace.inl
linear/ linear_bias.inl (includes softmax)
attention/ sdpa_causal.inl
tensor/ tensor.inl (transpose, reshape, concat, split)
embedding/ embedding.inl (lookup, kv_cache ops)
elementwise/ inplace.inl (add, mul, copy)
cast/ cast.inl (f32<->bf16/f16)

Key Implementation Details

  1. nn.cu as aggregator: Includes all .inl files to compile as single translation unit
  2. PYGPUKIT_IMPLEMENT_NN_KERNELS: Conditional compilation guard for kernel definitions
  3. Namespace handling: All .inl files use using namespace nn; for kernel access
  4. CMakeLists.txt: Simplified to only compile ops/nn/nn.cu

Test plan

  • Build passes (SM 120a, CUDA 13.1)
  • 238 pytest tests pass
  • Key NN ops verified: GELU, SiLU, RMSNorm, LayerNorm, Transpose, Softmax
  • Pre-commit checks pass (Ruff lint, Ruff format, Mypy)

🤖 Generated with Claude Code

m96-chan and others added 2 commits December 30, 2025 16:39
Split the monolithic ops_bindings.cpp (~3000 lines) into 39 organized
binding files for better maintainability and navigation.

Directory structure:
- elementwise/: binary, inplace, compare operations
- unary/: math, trig operations
- reduction/: basic, argmax, softmax operations
- tensor/: cast, transpose, reshape, repeat operations
- embedding/: lookup, kv_cache operations
- nn/: activation, norm, attention, rope operations
- gemm/: generic, fp8, nvf4, grouped, int operations
- gemv/: generic, fp8, nvf4 operations
- sampling/: basic, topk, seed operations
- Other: quantize, paged_attention, continuous_batching, audio, cublaslt, moe

Changes:
- ops_bindings.cpp reduced from ~3000 to ~77 lines (init calls only)
- bindings_common.hpp with shared includes and forward declarations
- CMakeLists.txt updated with all new source files
- Build verified: 238 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split the monolithic nn.cu (2673 lines) into modular files matching
the binding structure from Issue #131:

- activation/: gelu.inl, silu.inl, sigmoid.inl, tanh.inl
- norm/: layernorm.inl, rmsnorm.inl
- rope/: rope_inplace.inl
- linear/: linear_bias.inl (+ softmax)
- attention/: sdpa_causal.inl
- tensor/: tensor.inl (transpose, reshape, concat, split)
- embedding/: embedding.inl (lookup, kv_cache ops)
- elementwise/: inplace.inl (add, mul, copy)
- cast/: cast.inl (f32<->bf16/f16)

Key changes:
- nn.cu now aggregates all .inl files as single translation unit
- Avoids LNK2005 duplicate symbol errors from CUDA kernels
- activation_kernels.cuh uses PYGPUKIT_IMPLEMENT_NN_KERNELS guard
- All .inl files use 'using namespace nn;' for kernel access

Build: PASS (SM 120a, CUDA 13.1)
Tests: 238 passed, 6/6 key NN ops verified

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 031a9c6 into main Dec 30, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant