refactor: modularize codebase (Issues #139-#148) by m96-chan · Pull Request #168 · m96-chan/PyGPUkit

m96-chan · 2025-12-30T15:46:47Z

Summary

This PR refactors the PyGPUkit codebase to improve modularity and maintainability:

Python Modules

refactor(ops): Split matmul.py into modular files #139: Split matmul.py into modular package (matmul/, gemm/, gemv/)
refactor(audio): Split audio module into modular files #140: Split audio.py into modular package (audio/transforms/, audio/analysis/)
refactor(llm): Split model.py by model architecture #141: Move CausalTransformerModel to llm/models/ package
refactor(llm): Split layers.py by layer type #142: Split layers.py into layers/ package by layer type (attention, ffn, norm, embedding, recurrent)
refactor(llm): Clean up __init__.py - move logic to dedicated modules #143: Extract SafeTensors and Tokenizer to dedicated modules
refactor(llm): Simplify loader.py - extract format-specific loaders #144: Extract quantization configs and repack to dedicated modules
refactor(ops): Split nn.py to match native nn/ structure #145: Split nn.py into modular subpackage matching native structure
refactor(core): Split array.py - separate GPUArray and memory management #146: Add memory.py module for memory utilities

Native (C++/CUDA)

refactor(native): Split matmul.cu dispatcher into per-dtype files #147: Split matmul.cu dispatcher - extract fused ops and batched GEMM to separate files

Examples

refactor(examples): Consolidate and clean up example scripts #148: Consolidate example scripts into organized directories (benchmarks/, chat/, demos/archived/)

Changes

Issue	Module	Before	After
#139	ops/matmul	Single file	Package with gemm/, gemv/
#140	ops/audio	Single file	Package with transforms/, analysis/
#141	llm/model	Single file	llm/models/ package
#142	llm/layers	Single file	layers/ package
#143	llm/init	Mixed	Dedicated loader modules
#144	llm/quantization	Mixed	Dedicated quant modules
#145	ops/nn	Single file	nn/ package
#146	core/array	Mixed	Dedicated memory.py
#147	native/matmul.cu	Monolithic	Split into fused.cu, batched.cu
#148	examples/	Flat	Organized directories

Backwards Compatibility

All public APIs are preserved via re-exports in __init__.py files. Existing code will continue to work.

Test plan

Build passes (SM 120a, CUDA 13.1)
Ruff lint passes
Mypy type check passes
Import paths verified

🤖 Generated with Claude Code

Split 2087-line matmul.py into focused modules: - generic.py: matmul, batched_matmul, transpose, linear_bias_gelu - availability.py: all *_available() functions - fp8.py: FP8 GEMM operations - gemv.py: GEMV operations (M=1 optimized) - nvf4.py: NVF4 (4-bit) operations - grouped.py: Grouped GEMM for MoE - w8a16.py: W8A16 GEMM operations - __init__.py: Re-exports for backwards compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split 1828-line audio.py into focused modules: - buffer.py: AudioBuffer, AudioRingBuffer, AudioStream, from_pcm - vad.py: VAD, SpeechSegment - preprocessing.py: preemphasis, deemphasis, remove_dc, noise_gate, etc. - spectral.py: STFT, mel-spectrogram, MFCC, delta - phase.py: ISTFT, Griffin-Lim - pitch.py: YIN pitch detection, autocorrelation - features.py: spectral centroid, bandwidth, rolloff, flatness, contrast - cqt.py: Constant-Q Transform, chromagram - hpss.py: Harmonic-Percussive Source Separation - effects.py: time_stretch, pitch_shift - __init__.py: Re-exports for backwards compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create llm/models/ directory for model implementations - Move CausalTransformerModel to llm/models/causal.py - Update llm/model.py as re-export module for backwards compatibility - Maintain all existing public API exports (GPT2Model, LlamaModel, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create layers/ directory with modular submodules: - linear.py: LinearBF16, LinearFP8 - norm.py: Norm (RMSNorm/LayerNorm) - rope.py: RoPE utilities - attention.py: Attention layer - mlp.py: MLP layer - moe.py: MoELayer - block.py: TransformerBlock - utils.py: repack utilities - Remove monolithic layers.py (1492 lines -> 9 focused modules) - Maintain backwards compatibility via __init__.py re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…#143) - Extract Dtype, TensorInfo, SafeTensorsFile, ShardedSafeTensorsFile, load_safetensors to safetensors.py - Extract Tokenizer class to tokenizer.py - Reduce __init__.py from ~700 lines to ~197 lines (re-exports only) - Maintain full backwards compatibility via re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…odules (#144) - Extract FP8QuantConfig, QATQuantConfig, PruningConfig, SparsityConfig, ModelOptimizationInfo, and FP8 utilities to quant.py - Extract repack_model_weights to repack.py - Reduce loader.py from 1244 lines to 614 lines - Maintain full backwards compatibility via re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split monolithic nn.py (1000+ lines) into native structure-matching submodules: - activation.py: gelu, silu, sigmoid, tanh - norm.py: layernorm, rmsnorm - attention.py: sdpa_causal, sdpa_causal_fixed_cache, sdpa_causal_fixed_cache_ptr - rope.py: rope_inplace, rope_inplace_f32table - linear.py: bias_add_inplace, split_qkv_batch, slice_rows_range_ptr - recurrent.py: lstm_forward, lstm_bidirectional Backwards-compatible via __init__.py re-exports. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract memory management utilities into dedicated module: - get_memory_info: Query GPU memory - copy_to_device/copy_to_device_async: H2D transfers - copy_device_to_device_async/offset: D2D transfers - synchronize: Device synchronization Mirrors native/core/memory.hpp structure. GPUArray class remains in array.py (well-organized as-is). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract specialized operations from monolithic matmul.cu: - fused.cu: Fused linear+bias+GELU with CUTLASS epilogue fusion - batched.cu: Batched strided GEMM placeholder matmul.cu now focuses on core GEMM dispatch logic. Build verified: SM 120a CUDA 13.1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reorganize examples into logical directories: - benchmarks/: Performance benchmarks (matmul, CUDA Graph) - chat/: Chat CLI applications (standard, MoE, thinking, Triton) - demos/archived/: Version-specific demos (v01-v026) for reference Keep current demos at top level: - demo_gpu.py, demo_cuda_graph.py, demo_llm_e2e.py, etc. Update README.md with new structure and usage instructions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 10 commits December 30, 2025 23:38

m96-chan merged commit 4dacae5 into main Dec 30, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: modularize codebase (Issues #139-#148)#168

refactor: modularize codebase (Issues #139-#148)#168
m96-chan merged 10 commits intomainfrom
v0.2.18-refactor

m96-chan commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 30, 2025

Summary

Python Modules

Native (C++/CUDA)

Examples

Changes

Backwards Compatibility

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant