V0.2.18 refactor by m96-chan · Pull Request #173 · m96-chan/PyGPUkit

m96-chan · 2025-12-30T21:05:30Z

No description provided.

Split 2087-line matmul.py into focused modules: - generic.py: matmul, batched_matmul, transpose, linear_bias_gelu - availability.py: all *_available() functions - fp8.py: FP8 GEMM operations - gemv.py: GEMV operations (M=1 optimized) - nvf4.py: NVF4 (4-bit) operations - grouped.py: Grouped GEMM for MoE - w8a16.py: W8A16 GEMM operations - __init__.py: Re-exports for backwards compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split 1828-line audio.py into focused modules: - buffer.py: AudioBuffer, AudioRingBuffer, AudioStream, from_pcm - vad.py: VAD, SpeechSegment - preprocessing.py: preemphasis, deemphasis, remove_dc, noise_gate, etc. - spectral.py: STFT, mel-spectrogram, MFCC, delta - phase.py: ISTFT, Griffin-Lim - pitch.py: YIN pitch detection, autocorrelation - features.py: spectral centroid, bandwidth, rolloff, flatness, contrast - cqt.py: Constant-Q Transform, chromagram - hpss.py: Harmonic-Percussive Source Separation - effects.py: time_stretch, pitch_shift - __init__.py: Re-exports for backwards compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create llm/models/ directory for model implementations - Move CausalTransformerModel to llm/models/causal.py - Update llm/model.py as re-export module for backwards compatibility - Maintain all existing public API exports (GPT2Model, LlamaModel, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create layers/ directory with modular submodules: - linear.py: LinearBF16, LinearFP8 - norm.py: Norm (RMSNorm/LayerNorm) - rope.py: RoPE utilities - attention.py: Attention layer - mlp.py: MLP layer - moe.py: MoELayer - block.py: TransformerBlock - utils.py: repack utilities - Remove monolithic layers.py (1492 lines -> 9 focused modules) - Maintain backwards compatibility via __init__.py re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…#143) - Extract Dtype, TensorInfo, SafeTensorsFile, ShardedSafeTensorsFile, load_safetensors to safetensors.py - Extract Tokenizer class to tokenizer.py - Reduce __init__.py from ~700 lines to ~197 lines (re-exports only) - Maintain full backwards compatibility via re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…odules (#144) - Extract FP8QuantConfig, QATQuantConfig, PruningConfig, SparsityConfig, ModelOptimizationInfo, and FP8 utilities to quant.py - Extract repack_model_weights to repack.py - Reduce loader.py from 1244 lines to 614 lines - Maintain full backwards compatibility via re-exports 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Split monolithic nn.py (1000+ lines) into native structure-matching submodules: - activation.py: gelu, silu, sigmoid, tanh - norm.py: layernorm, rmsnorm - attention.py: sdpa_causal, sdpa_causal_fixed_cache, sdpa_causal_fixed_cache_ptr - rope.py: rope_inplace, rope_inplace_f32table - linear.py: bias_add_inplace, split_qkv_batch, slice_rows_range_ptr - recurrent.py: lstm_forward, lstm_bidirectional Backwards-compatible via __init__.py re-exports. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract memory management utilities into dedicated module: - get_memory_info: Query GPU memory - copy_to_device/copy_to_device_async: H2D transfers - copy_device_to_device_async/offset: D2D transfers - synchronize: Device synchronization Mirrors native/core/memory.hpp structure. GPUArray class remains in array.py (well-organized as-is). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract specialized operations from monolithic matmul.cu: - fused.cu: Fused linear+bias+GELU with CUTLASS epilogue fusion - batched.cu: Batched strided GEMM placeholder matmul.cu now focuses on core GEMM dispatch logic. Build verified: SM 120a CUDA 13.1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reorganize examples into logical directories: - benchmarks/: Performance benchmarks (matmul, CUDA Graph) - chat/: Chat CLI applications (standard, MoE, thinking, Triton) - demos/archived/: Version-specific demos (v01-v026) for reference Keep current demos at top level: - demo_gpu.py, demo_cuda_graph.py, demo_llm_e2e.py, etc. Update README.md with new structure and usage instructions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Updated directory structure to reflect modular organization: - Python: ops/, llm/, core/ now show subpackages - Native: matmul/ shows fused.cu, batched.cu, nn/ substructure - Examples: organized into benchmarks/, chat/, demos/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 11 commits December 30, 2025 23:38

m96-chan merged commit 591fb7a into main Dec 30, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.2.18 refactor#173

V0.2.18 refactor#173
m96-chan merged 11 commits intomainfrom
v0.2.18-refactor

m96-chan commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant