Skip to content

[v0.2.9] milestone: LLM MVP Support #34

@m96-chan

Description

@m96-chan

🚀 PyGPUkit v0.2.6 Milestone — “LLM MVP Support”

🎯 Milestone Summary

PyGPUkit v0.2.6 introduces the first minimal LLM inference capability:

  • Load safetensors model weights
  • Load tokenizer.json
  • Execute a single forward pass (1 token) using PyGPUkit GPU kernels
  • Integrate with Scheduler + MemoryPool
  • Provide a minimal generate() API

This version does not aim to support full transformer execution yet—
only the core pieces required to demonstrate that
“PyGPUkit can run an LLM block end-to-end on GPU”.


🔧 Scope of v0.2.6 (Exactly What Will Be Implemented)

1) safetensors Loader (Rust Core)

Features

  • Support reading *.safetensors files via Rust safetensors crate
  • Mmap or buffered load
  • Return tensor slices without copying
  • Expose metadata (dtype, shape)

Deliverable

pygpukit_core::llm::tensor_loader::load_safetensors(path) -> TensorMap


2) GPU Tensor Allocation Path

Features

  • Allocate GPU buffers for model weights via MemoryPool
  • Async H2D transfers
  • Persistent device pointers

Deliverable

TensorDevice::from_safetensor(cpu_tensor)


3) tokenizer.json Reader (Python MVP → optional Rust port)

Features

  • Read HuggingFace tokenizer.json
  • BPE/SentencePiece vocabulary support
  • Expose:
tokenizer.encode(text) -> ids
tokenizer.decode(ids) -> text

Deliverable

pygpukit.llm.Tokenizer


4) Linear Layer (Rust + CUDA kernel)

Features

  • GPU matmul (64×64 tiled kernel)
  • Add bias
  • Optional activation (GELU)

Deliverable

Linear.forward(x, weight, bias)


5) LayerNorm Kernel

Features

  • Mean/variance reduce
  • Warp-level reduction
  • Single-pass normalization
  • Fused add+norm optional

Deliverable

LayerNorm.forward(x, gamma, beta)


6) Forward Pass for One Transformer Block

Supported operations:

x → LayerNorm → Linear → GELU → Linear → residual

Deliverable

TransformerBlock.forward(x, weights)

※ attention は v0.2.7 に回すのが妥当
(理由:rotary / kv-cache / softmax が重い)


7) Minimal LLM Runtime API (Python)

Goal

モデルロード後、1 トークンだけ生成できる。

Deliverable

from pygpukit.llm import LLM

llm = LLM.from_pretrained("model/")
ids = llm.generate("Hello", max_new_tokens=1)

内部処理:

  • tokenizer.encode
  • embedding → block.forward → lm_head
  • top-k or argmax sampler
  • tokenizer.decode

8) Scheduler Integration

  • Each kernel dispatch uses Rust KernelDispatchController
  • MemoryPool controls all tensors
  • Async H2D/D2H transfers via AsyncEngine

📦 Out of Scope (Next Milestone = v0.2.7)

  • Multi-head Attention
  • Rotary Embedding
  • KV-Cache
  • Softmax kernel
  • Full autoregressive generation loop
  • Batch inference
  • TensorCore (TF32 / FP16) optimization
  • Quantization (4bit/8bit)

📅 Recommended Sub-Milestones Inside v0.2.6

v0.2.6-a: safetensors loader + GPU tensor storage
v0.2.6-b: tokenizer.json MVP
v0.2.6-c: Linear + LayerNorm kernels
v0.2.6-d: TransformerBlock (no attention)
v0.2.6-e: LLM.generate() MVP
v0.2.6-f: Scheduler integration + benchmarks
v0.2.6-final: Documentation + examples

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions