feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80
Merged
feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80
Conversation
- Create scripts/ directory for development tools - Move benchmark*.py to scripts/ - Move build_cuda*.bat, compile_dump.bat to scripts/ - Move dump_*.cu debug tools to scripts/ - Move demo_scheduler_log.py to examples/ - Delete redundant TechStack.md (info in CLAUDE.md) - Update README.md Project Structure section Root directory now contains only essential project files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## New Features - **Softmax GPU kernel**: Row-wise softmax with numerical stability - **CausalSelfAttention**: Multi-head causal self-attention for GPT-2 - **Full TransformerBlock**: ln_1 -> attention -> residual -> ln_2 -> mlp -> residual ## Changes - `softmax()` operation added to ops API (native GPU + CPU fallback) - `CausalSelfAttention` class with QKV projection and causal masking - `TransformerBlock` updated to support attention (backward compatible) - `load_gpt2_from_safetensors()` now loads attention weights by default ## API - `gpk.softmax(input)` - Row-wise softmax - `gpk.llm.CausalSelfAttention` - Attention module - `load_gpt2_from_safetensors(path, load_attention=True)` - Full model loading ## Architecture Support - GPT-2 E2E inference now possible - GPT-2/GPT-Neo/LLaMA-style architectures supported 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Unify GPT-2 and LLaMA into common Transformer abstraction
- TransformerConfig: vocab_size, hidden_size, num_layers, num_heads,
num_kv_heads, norm_type, activation, use_rope, causal
- CausalTransformerModel with generate() method
- Attention, MLP, Norm, TransformerBlock classes
- Legacy aliases preserved for backward compatibility
- Hybrid Attention execution
- GPU SDPA for prefill (seq_len > 1)
- CPU numpy for decode (seq_len = 1) to minimize kernel overhead
- New GPU tensor ops (CUDA kernels)
- concat_axis0, repeat_interleave_axis1, transpose_3d_021, reshape_copy
- Required for GQA KV head expansion
- Add E2E demo (examples/demo_llm_e2e.py)
Benchmark (RTX 3090 Ti):
GPT-2 (124M): 11.2 tok/s decode, 89.6 ms/token
TinyLlama (1.1B): 5.3 tok/s decode, 188.2 ms/token
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- ShardedSafeTensorsFile: lazy-load sharded safetensors models - load_safetensors() auto-detects .index.json - Opens shards on-demand, not all at once - QK Norm support in Attention class - For Qwen3 style models with Q/K normalization - Reshape 3D->2D for norm, then back to 3D - Qwen3Config and load_qwen3_from_safetensors() - head_dim=128, rope_theta=1e6 - Auto-detect config from tensor shapes Note: Qwen3-8B requires ~32GB VRAM at FP32, needs FP16 support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dtype parameter to load_qwen3_from_safetensors (float32/float16) - Fix RoPE to handle FP16: convert to FP32 for computation, back to FP16 - Fix SDPA dtype preservation for KV cache - Fix _forward_cpu QK Norm dtype preservation - Fix o_proj output dtype in _forward_cpu - Add FP16 fallback for reshape_copy (native only supports FP32) - Add FP16 fallback for transpose_3d_021 (native only supports FP32) - Fix unused variable kv_len in sdpa_causal Tested: Qwen3-8B (16.4GB) FP16 inference fits in 24GB VRAM - Forward pass: 2297ms for 1 token - Generation: 16 tokens in 73.9s (0.2 tok/s with CPU fallbacks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ape_copy - Add transpose_021_f16_kernel and transpose_021_bf16_kernel - Add copy_f16_kernel and copy_bf16_kernel for reshape - Update nn.cu dispatch to use new FP16/BF16 kernels - Update Python ops/basic.py to route FP16/BF16 to native kernels Previously these operations fell back to CPU for FP16, causing slow inference. Now they run natively on GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces ModelSpec data structure to unify model differences: - ModelSpec: frozen dataclass with weight patterns and arch flags - GPT2_SPEC: LayerNorm, GELU, combined QKV, position embeddings - LLAMA_SPEC: RMSNorm, SiLU, RoPE, GQA - QWEN3_SPEC: RMSNorm, SiLU, RoPE, GQA, QK Norm New generic loader: - load_model_from_safetensors(): auto-detects model type - detect_model_spec(): detects from tensor names - MODEL_SPECS registry for model type lookup Existing loaders preserved unchanged for backward compatibility: - load_gpt2_from_safetensors() - load_llama_from_safetensors() - load_qwen3_from_safetensors() This is a structural refactor only - no behavior changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert load_gpt2_from_safetensors, load_llama_from_safetensors, and load_qwen3_from_safetensors to thin wrappers that delegate to the generic load_model_from_safetensors function with appropriate ModelSpec. This completes the ModelSpec abstraction refactor: - All three loaders now use load_model_from_safetensors internally - Backward compatibility preserved via legacy model class wrappers - Config parameters are now ignored (auto-detected from tensor shapes) - ~150 lines of duplicated code removed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Final ModelSpec cleanup for v0.2.9: - CausalTransformerModel is now the ONLY runtime model - Add `spec` attribute to store ModelSpec used for loading - GPT2Model, LlamaModel are now simple type aliases - RMSNorm, LayerNorm, etc. are simple aliases to Norm - Remove all legacy wrapper classes with constructors - Simplify loaders to direct load_model_from_safetensors calls - Remove redundant config parameters from loaders Code reduction: 188 deletions, 48 insertions (-140 net lines) All model-specific behavior is now controlled via: - model.spec.use_rope - model.spec.use_qk_norm - model.spec.activation - model.spec.norm_type 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive tests for the ModelSpec refactor: - Type aliases (GPT2Model = LlamaModel = CausalTransformerModel) - Component aliases (RMSNorm = Norm, CausalSelfAttention = Attention) - ModelSpec instances and registry - Automatic model detection from tensor names - Simplified loader signatures (no config parameter) - CausalTransformerModel spec attribute - All expected exports from pygpukit.llm 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use load_model_from_safetensors with detect_model_spec - Add --dtype argument for FP16/FP32 selection - Show ModelSpec info in detection output - Display model.spec attribute after loading 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tokenizer policy section: PyGPUkit delegates tokenization to HuggingFace - Mark pygpukit.llm.Tokenizer as EXPERIMENTAL with detailed docstring - Update README LLM section with unified interface examples - Add v0.2.9 to roadmap (unified LLM interface, ModelSpec abstraction) - Update API stability table with CausalTransformerModel 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Unified LLM Interface (CausalTransformerModel + ModelSpec) - Multi-architecture support: GPT-2, LLaMA 2/3, Qwen3 - Hybrid attention execution (CPU decode / GPU prefill) - New LLM operations: sdpa_causal, rope_inplace, silu, rmsnorm - Sharded model support for large models - Updated documentation table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add assertions for num_kv_heads (set in __post_init__) - Add type annotations for _cos/_sin as Optional[ndarray] - Use hidden_np for numpy array, hidden for GPUArray - Fix return type annotations for __call__ method - Add assertions for _cos/_sin before indexing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This was referenced Dec 16, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CausalTransformerModelsupports GPT-2, LLaMA 2/3, and Qwen3detect_model_spec()identifies model type from tensor namesmodel.generate()New APIs
load_model_from_safetensors(path, dtype, spec)detect_model_spec(tensor_names)model.generate(ids, max_new_tokens, ...)gpk.sdpa_causal(q, k, v)gpk.rope_inplace(x, freqs)gpk.silu(x)gpk.rmsnorm(x, w, eps)Tested Models
Breaking Changes
None. Legacy aliases (
GPT2Model,LlamaModel, etc.) still work.Test plan
pytest tests/test_llm_unified.py -v(9 tests pass)🤖 Generated with Claude Code