feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction by m96-chan · Pull Request #80 · m96-chan/PyGPUkit

m96-chan · 2025-12-16T04:34:37Z

Summary

Unified LLM Interface: Single CausalTransformerModel supports GPT-2, LLaMA 2/3, and Qwen3
ModelSpec Abstraction: Architecture differences (norm type, activation, RoPE, QK-norm) handled declaratively
Auto-detection: detect_model_spec() identifies model type from tensor names
Hybrid Attention: CPU decode (seq_len=1) + GPU prefill for optimal performance
KV-Cache Generation: Efficient autoregressive text generation with model.generate()
Sharded Model Support: Load large models split across multiple safetensors files
FP16/FP32 Selection: Choose dtype at load time

New APIs

API	Description
`load_model_from_safetensors(path, dtype, spec)`	Unified model loader
`detect_model_spec(tensor_names)`	Auto-detect GPT2/LLaMA/Qwen3
`model.generate(ids, max_new_tokens, ...)`	KV-cache generation
`gpk.sdpa_causal(q, k, v)`	Scaled Dot-Product Attention
`gpk.rope_inplace(x, freqs)`	Rotary Position Embedding
`gpk.silu(x)`	SiLU activation
`gpk.rmsnorm(x, w, eps)`	RMS LayerNorm

Tested Models

Model	Size	Status
GPT-2	124M	✅ 8.7 tok/s
TinyLlama-1.1B	1.1B	✅ 1.8 tok/s (FP16)
Qwen3-8B	8B	✅ 0.2 tok/s (FP16)

Breaking Changes

None. Legacy aliases (GPT2Model, LlamaModel, etc.) still work.

Test plan

pytest tests/test_llm_unified.py -v (9 tests pass)
GPT-2 E2E demo with generation
TinyLlama-1.1B E2E demo with FP16
Qwen3-8B E2E demo with HuggingFace tokenizers

🤖 Generated with Claude Code

- Create scripts/ directory for development tools - Move benchmark*.py to scripts/ - Move build_cuda*.bat, compile_dump.bat to scripts/ - Move dump_*.cu debug tools to scripts/ - Move demo_scheduler_log.py to examples/ - Delete redundant TechStack.md (info in CLAUDE.md) - Update README.md Project Structure section Root directory now contains only essential project files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

## New Features - **Softmax GPU kernel**: Row-wise softmax with numerical stability - **CausalSelfAttention**: Multi-head causal self-attention for GPT-2 - **Full TransformerBlock**: ln_1 -> attention -> residual -> ln_2 -> mlp -> residual ## Changes - `softmax()` operation added to ops API (native GPU + CPU fallback) - `CausalSelfAttention` class with QKV projection and causal masking - `TransformerBlock` updated to support attention (backward compatible) - `load_gpt2_from_safetensors()` now loads attention weights by default ## API - `gpk.softmax(input)` - Row-wise softmax - `gpk.llm.CausalSelfAttention` - Attention module - `load_gpt2_from_safetensors(path, load_attention=True)` - Full model loading ## Architecture Support - GPT-2 E2E inference now possible - GPT-2/GPT-Neo/LLaMA-style architectures supported 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Unify GPT-2 and LLaMA into common Transformer abstraction - TransformerConfig: vocab_size, hidden_size, num_layers, num_heads, num_kv_heads, norm_type, activation, use_rope, causal - CausalTransformerModel with generate() method - Attention, MLP, Norm, TransformerBlock classes - Legacy aliases preserved for backward compatibility - Hybrid Attention execution - GPU SDPA for prefill (seq_len > 1) - CPU numpy for decode (seq_len = 1) to minimize kernel overhead - New GPU tensor ops (CUDA kernels) - concat_axis0, repeat_interleave_axis1, transpose_3d_021, reshape_copy - Required for GQA KV head expansion - Add E2E demo (examples/demo_llm_e2e.py) Benchmark (RTX 3090 Ti): GPT-2 (124M): 11.2 tok/s decode, 89.6 ms/token TinyLlama (1.1B): 5.3 tok/s decode, 188.2 ms/token 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- ShardedSafeTensorsFile: lazy-load sharded safetensors models - load_safetensors() auto-detects .index.json - Opens shards on-demand, not all at once - QK Norm support in Attention class - For Qwen3 style models with Q/K normalization - Reshape 3D->2D for norm, then back to 3D - Qwen3Config and load_qwen3_from_safetensors() - head_dim=128, rope_theta=1e6 - Auto-detect config from tensor shapes Note: Qwen3-8B requires ~32GB VRAM at FP32, needs FP16 support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add dtype parameter to load_qwen3_from_safetensors (float32/float16) - Fix RoPE to handle FP16: convert to FP32 for computation, back to FP16 - Fix SDPA dtype preservation for KV cache - Fix _forward_cpu QK Norm dtype preservation - Fix o_proj output dtype in _forward_cpu - Add FP16 fallback for reshape_copy (native only supports FP32) - Add FP16 fallback for transpose_3d_021 (native only supports FP32) - Fix unused variable kv_len in sdpa_causal Tested: Qwen3-8B (16.4GB) FP16 inference fits in 24GB VRAM - Forward pass: 2297ms for 1 token - Generation: 16 tokens in 73.9s (0.2 tok/s with CPU fallbacks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ape_copy - Add transpose_021_f16_kernel and transpose_021_bf16_kernel - Add copy_f16_kernel and copy_bf16_kernel for reshape - Update nn.cu dispatch to use new FP16/BF16 kernels - Update Python ops/basic.py to route FP16/BF16 to native kernels Previously these operations fell back to CPU for FP16, causing slow inference. Now they run natively on GPU. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Introduces ModelSpec data structure to unify model differences: - ModelSpec: frozen dataclass with weight patterns and arch flags - GPT2_SPEC: LayerNorm, GELU, combined QKV, position embeddings - LLAMA_SPEC: RMSNorm, SiLU, RoPE, GQA - QWEN3_SPEC: RMSNorm, SiLU, RoPE, GQA, QK Norm New generic loader: - load_model_from_safetensors(): auto-detects model type - detect_model_spec(): detects from tensor names - MODEL_SPECS registry for model type lookup Existing loaders preserved unchanged for backward compatibility: - load_gpt2_from_safetensors() - load_llama_from_safetensors() - load_qwen3_from_safetensors() This is a structural refactor only - no behavior changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Convert load_gpt2_from_safetensors, load_llama_from_safetensors, and load_qwen3_from_safetensors to thin wrappers that delegate to the generic load_model_from_safetensors function with appropriate ModelSpec. This completes the ModelSpec abstraction refactor: - All three loaders now use load_model_from_safetensors internally - Backward compatibility preserved via legacy model class wrappers - Config parameters are now ignored (auto-detected from tensor shapes) - ~150 lines of duplicated code removed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Final ModelSpec cleanup for v0.2.9: - CausalTransformerModel is now the ONLY runtime model - Add `spec` attribute to store ModelSpec used for loading - GPT2Model, LlamaModel are now simple type aliases - RMSNorm, LayerNorm, etc. are simple aliases to Norm - Remove all legacy wrapper classes with constructors - Simplify loaders to direct load_model_from_safetensors calls - Remove redundant config parameters from loaders Code reduction: 188 deletions, 48 insertions (-140 net lines) All model-specific behavior is now controlled via: - model.spec.use_rope - model.spec.use_qk_norm - model.spec.activation - model.spec.norm_type 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive tests for the ModelSpec refactor: - Type aliases (GPT2Model = LlamaModel = CausalTransformerModel) - Component aliases (RMSNorm = Norm, CausalSelfAttention = Attention) - ModelSpec instances and registry - Automatic model detection from tensor names - Simplified loader signatures (no config parameter) - CausalTransformerModel spec attribute - All expected exports from pygpukit.llm 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use load_model_from_safetensors with detect_model_spec - Add --dtype argument for FP16/FP32 selection - Show ModelSpec info in detection output - Display model.spec attribute after loading 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tokenizer policy section: PyGPUkit delegates tokenization to HuggingFace - Mark pygpukit.llm.Tokenizer as EXPERIMENTAL with detailed docstring - Update README LLM section with unified interface examples - Add v0.2.9 to roadmap (unified LLM interface, ModelSpec abstraction) - Update API stability table with CausalTransformerModel 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Unified LLM Interface (CausalTransformerModel + ModelSpec) - Multi-architecture support: GPT-2, LLaMA 2/3, Qwen3 - Hybrid attention execution (CPU decode / GPU prefill) - New LLM operations: sdpa_causal, rope_inplace, silu, rmsnorm - Sharded model support for large models - Updated documentation table 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add assertions for num_kv_heads (set in __post_init__) - Add type annotations for _cos/_sin as Optional[ndarray] - Use hidden_np for numpy array, hidden for GPUArray - Fix return type annotations for __call__ method - Add assertions for _cos/_sin before indexing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 14 commits December 16, 2025 08:40

m96-chan merged commit 10fc369 into main Dec 16, 2025
13 checks passed

This was referenced Dec 16, 2025

[v0.2.9] feat: General LLM Execution - Attention layer and E2E inference #78

Closed

[v0.2.9] milestone: LLM MVP Support #34

Closed

m96-chan deleted the feature/v0.2.9 branch December 26, 2025 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80

feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80
m96-chan merged 14 commits intomainfrom
feature/v0.2.9

m96-chan commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 16, 2025

Summary

New APIs

Tested Models

Breaking Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant