Skip to content

feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80

Merged
m96-chan merged 14 commits intomainfrom
feature/v0.2.9
Dec 16, 2025
Merged

feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction#80
m96-chan merged 14 commits intomainfrom
feature/v0.2.9

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

Summary

  • Unified LLM Interface: Single CausalTransformerModel supports GPT-2, LLaMA 2/3, and Qwen3
  • ModelSpec Abstraction: Architecture differences (norm type, activation, RoPE, QK-norm) handled declaratively
  • Auto-detection: detect_model_spec() identifies model type from tensor names
  • Hybrid Attention: CPU decode (seq_len=1) + GPU prefill for optimal performance
  • KV-Cache Generation: Efficient autoregressive text generation with model.generate()
  • Sharded Model Support: Load large models split across multiple safetensors files
  • FP16/FP32 Selection: Choose dtype at load time

New APIs

API Description
load_model_from_safetensors(path, dtype, spec) Unified model loader
detect_model_spec(tensor_names) Auto-detect GPT2/LLaMA/Qwen3
model.generate(ids, max_new_tokens, ...) KV-cache generation
gpk.sdpa_causal(q, k, v) Scaled Dot-Product Attention
gpk.rope_inplace(x, freqs) Rotary Position Embedding
gpk.silu(x) SiLU activation
gpk.rmsnorm(x, w, eps) RMS LayerNorm

Tested Models

Model Size Status
GPT-2 124M ✅ 8.7 tok/s
TinyLlama-1.1B 1.1B ✅ 1.8 tok/s (FP16)
Qwen3-8B 8B ✅ 0.2 tok/s (FP16)

Breaking Changes

None. Legacy aliases (GPT2Model, LlamaModel, etc.) still work.

Test plan

  • pytest tests/test_llm_unified.py -v (9 tests pass)
  • GPT-2 E2E demo with generation
  • TinyLlama-1.1B E2E demo with FP16
  • Qwen3-8B E2E demo with HuggingFace tokenizers

🤖 Generated with Claude Code

m96-chan and others added 14 commits December 16, 2025 08:40
- Create scripts/ directory for development tools
- Move benchmark*.py to scripts/
- Move build_cuda*.bat, compile_dump.bat to scripts/
- Move dump_*.cu debug tools to scripts/
- Move demo_scheduler_log.py to examples/
- Delete redundant TechStack.md (info in CLAUDE.md)
- Update README.md Project Structure section

Root directory now contains only essential project files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## New Features
- **Softmax GPU kernel**: Row-wise softmax with numerical stability
- **CausalSelfAttention**: Multi-head causal self-attention for GPT-2
- **Full TransformerBlock**: ln_1 -> attention -> residual -> ln_2 -> mlp -> residual

## Changes
- `softmax()` operation added to ops API (native GPU + CPU fallback)
- `CausalSelfAttention` class with QKV projection and causal masking
- `TransformerBlock` updated to support attention (backward compatible)
- `load_gpt2_from_safetensors()` now loads attention weights by default

## API
- `gpk.softmax(input)` - Row-wise softmax
- `gpk.llm.CausalSelfAttention` - Attention module
- `load_gpt2_from_safetensors(path, load_attention=True)` - Full model loading

## Architecture Support
- GPT-2 E2E inference now possible
- GPT-2/GPT-Neo/LLaMA-style architectures supported

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Unify GPT-2 and LLaMA into common Transformer abstraction
  - TransformerConfig: vocab_size, hidden_size, num_layers, num_heads,
    num_kv_heads, norm_type, activation, use_rope, causal
  - CausalTransformerModel with generate() method
  - Attention, MLP, Norm, TransformerBlock classes
  - Legacy aliases preserved for backward compatibility

- Hybrid Attention execution
  - GPU SDPA for prefill (seq_len > 1)
  - CPU numpy for decode (seq_len = 1) to minimize kernel overhead

- New GPU tensor ops (CUDA kernels)
  - concat_axis0, repeat_interleave_axis1, transpose_3d_021, reshape_copy
  - Required for GQA KV head expansion

- Add E2E demo (examples/demo_llm_e2e.py)

Benchmark (RTX 3090 Ti):
  GPT-2 (124M): 11.2 tok/s decode, 89.6 ms/token
  TinyLlama (1.1B): 5.3 tok/s decode, 188.2 ms/token

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- ShardedSafeTensorsFile: lazy-load sharded safetensors models
  - load_safetensors() auto-detects .index.json
  - Opens shards on-demand, not all at once

- QK Norm support in Attention class
  - For Qwen3 style models with Q/K normalization
  - Reshape 3D->2D for norm, then back to 3D

- Qwen3Config and load_qwen3_from_safetensors()
  - head_dim=128, rope_theta=1e6
  - Auto-detect config from tensor shapes

Note: Qwen3-8B requires ~32GB VRAM at FP32, needs FP16 support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dtype parameter to load_qwen3_from_safetensors (float32/float16)
- Fix RoPE to handle FP16: convert to FP32 for computation, back to FP16
- Fix SDPA dtype preservation for KV cache
- Fix _forward_cpu QK Norm dtype preservation
- Fix o_proj output dtype in _forward_cpu
- Add FP16 fallback for reshape_copy (native only supports FP32)
- Add FP16 fallback for transpose_3d_021 (native only supports FP32)
- Fix unused variable kv_len in sdpa_causal

Tested: Qwen3-8B (16.4GB) FP16 inference fits in 24GB VRAM
- Forward pass: 2297ms for 1 token
- Generation: 16 tokens in 73.9s (0.2 tok/s with CPU fallbacks)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ape_copy

- Add transpose_021_f16_kernel and transpose_021_bf16_kernel
- Add copy_f16_kernel and copy_bf16_kernel for reshape
- Update nn.cu dispatch to use new FP16/BF16 kernels
- Update Python ops/basic.py to route FP16/BF16 to native kernels

Previously these operations fell back to CPU for FP16, causing slow
inference. Now they run natively on GPU.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces ModelSpec data structure to unify model differences:
- ModelSpec: frozen dataclass with weight patterns and arch flags
- GPT2_SPEC: LayerNorm, GELU, combined QKV, position embeddings
- LLAMA_SPEC: RMSNorm, SiLU, RoPE, GQA
- QWEN3_SPEC: RMSNorm, SiLU, RoPE, GQA, QK Norm

New generic loader:
- load_model_from_safetensors(): auto-detects model type
- detect_model_spec(): detects from tensor names
- MODEL_SPECS registry for model type lookup

Existing loaders preserved unchanged for backward compatibility:
- load_gpt2_from_safetensors()
- load_llama_from_safetensors()
- load_qwen3_from_safetensors()

This is a structural refactor only - no behavior changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Convert load_gpt2_from_safetensors, load_llama_from_safetensors, and
load_qwen3_from_safetensors to thin wrappers that delegate to the
generic load_model_from_safetensors function with appropriate ModelSpec.

This completes the ModelSpec abstraction refactor:
- All three loaders now use load_model_from_safetensors internally
- Backward compatibility preserved via legacy model class wrappers
- Config parameters are now ignored (auto-detected from tensor shapes)
- ~150 lines of duplicated code removed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Final ModelSpec cleanup for v0.2.9:

- CausalTransformerModel is now the ONLY runtime model
- Add `spec` attribute to store ModelSpec used for loading
- GPT2Model, LlamaModel are now simple type aliases
- RMSNorm, LayerNorm, etc. are simple aliases to Norm
- Remove all legacy wrapper classes with constructors
- Simplify loaders to direct load_model_from_safetensors calls
- Remove redundant config parameters from loaders

Code reduction: 188 deletions, 48 insertions (-140 net lines)

All model-specific behavior is now controlled via:
- model.spec.use_rope
- model.spec.use_qk_norm
- model.spec.activation
- model.spec.norm_type

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive tests for the ModelSpec refactor:
- Type aliases (GPT2Model = LlamaModel = CausalTransformerModel)
- Component aliases (RMSNorm = Norm, CausalSelfAttention = Attention)
- ModelSpec instances and registry
- Automatic model detection from tensor names
- Simplified loader signatures (no config parameter)
- CausalTransformerModel spec attribute
- All expected exports from pygpukit.llm

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use load_model_from_safetensors with detect_model_spec
- Add --dtype argument for FP16/FP32 selection
- Show ModelSpec info in detection output
- Display model.spec attribute after loading

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tokenizer policy section: PyGPUkit delegates tokenization to HuggingFace
- Mark pygpukit.llm.Tokenizer as EXPERIMENTAL with detailed docstring
- Update README LLM section with unified interface examples
- Add v0.2.9 to roadmap (unified LLM interface, ModelSpec abstraction)
- Update API stability table with CausalTransformerModel

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Unified LLM Interface (CausalTransformerModel + ModelSpec)
- Multi-architecture support: GPT-2, LLaMA 2/3, Qwen3
- Hybrid attention execution (CPU decode / GPU prefill)
- New LLM operations: sdpa_causal, rope_inplace, silu, rmsnorm
- Sharded model support for large models
- Updated documentation table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add assertions for num_kv_heads (set in __post_init__)
- Add type annotations for _cos/_sin as Optional[ndarray]
- Use hidden_np for numpy array, hidden for GPUArray
- Fix return type annotations for __call__ method
- Add assertions for _cos/_sin before indexing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan merged commit 10fc369 into main Dec 16, 2025
13 checks passed
@m96-chan m96-chan deleted the feature/v0.2.9 branch December 26, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant