m96-chan · m96-chan · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025 · Dec 30, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -35,34 +35,75 @@ The core scheduling, memory management, GPU coordination, and performance-critic
 ```
 PyGPUkit/
 ├── src/pygpukit/           # Python API (NumPy-compatible)
-│   ├── core/               # GPUArray, backend abstraction
-│   ├── ops/                # GPU operations (matmul, nn, audio, etc.)
-│   ├── llm/                # LLM inference (Qwen, LLaMA)
+│   ├── core/               # Core abstractions
+│   │   ├── array.py        # GPUArray implementation
+│   │   ├── backend.py      # Backend detection/initialization
+│   │   ├── memory.py       # Memory utilities (copy, sync)
+│   │   └── stream.py       # CUDA Stream wrapper
+│   ├── ops/                # GPU operations (modular packages)
+│   │   ├── matmul/         # Matrix multiplication
+│   │   │   ├── gemm/       # GEMM operations (M > 1)
+│   │   │   └── gemv/       # GEMV operations (M = 1)
+│   │   ├── nn/             # Neural network ops
+│   │   │   ├── activation.py   # GELU, SiLU, etc.
+│   │   │   ├── attention.py    # SDPA, paged attention
+│   │   │   ├── norm.py         # RMSNorm, LayerNorm
+│   │   │   └── rope.py         # Rotary position embedding
+│   │   └── audio/          # Audio processing
+│   │       ├── transforms/ # FFT, Mel spectrogram
+│   │       └── analysis/   # Pitch, onset detection
+│   ├── llm/                # LLM inference (modular)
 │   │   ├── models/         # Model implementations
-│   │   └── sampling/       # Token sampling strategies
-│   └── asr/                # Speech recognition (Whisper)
-│       ├── preprocessing.py    # Audio preprocessing (mel, normalize)
-│       └── whisper/            # Whisper model implementation
-│           ├── config.py       # WhisperConfig
-│           ├── loader.py       # SafeTensors loader
-│           ├── encoder.py      # Whisper encoder
-│           ├── decoder.py      # Whisper decoder
-│           └── model.py        # WhisperModel high-level API
+│   │   │   └── causal_transformer.py
+│   │   ├── layers/         # Layer types
+│   │   │   ├── attention.py    # Multi-head attention
+│   │   │   ├── ffn.py          # Feed-forward networks
+│   │   │   ├── norm.py         # Normalization layers
+│   │   │   ├── embedding.py    # Token/position embeddings
+│   │   │   └── recurrent.py    # LSTM, Mamba
+│   │   ├── decode/         # Decoding strategies
+│   │   ├── loader/         # Model loading
+│   │   │   ├── safetensors.py  # SafeTensors loader
+│   │   │   └── tokenizer.py    # Tokenizer wrapper
+│   │   └── quantization/   # Quantization utilities
+│   │       ├── config.py       # Quant configs
+│   │       └── repack.py       # Weight repacking
+│   ├── asr/                # Speech recognition (Whisper)
+│   │   └── whisper/        # Whisper model implementation
+│   └── tts/                # Text-to-speech (Kokoro)
+│       └── kokoro/         # Kokoro TTS model
 ├── native/
 │   ├── core/               # C++ (CUDA Runtime/Driver API)
 │   ├── jit/                # C++ (NVRTC)
 │   ├── ops/                # C++ (CUDA kernels)
-│   │   └── matmul/         # MatMul kernels (see below)
-│   └── bindings/           # pybind11
+│   │   ├── matmul/         # MatMul kernels (see below)
+│   │   │   ├── matmul.cu       # Main dispatcher
+│   │   │   ├── fused.cu        # Fused ops (linear+bias+GELU)
+│   │   │   └── batched.cu      # Batched GEMM
+│   │   ├── nn/             # Neural network ops
+│   │   │   ├── activation/ # Activation functions
+│   │   │   ├── attention/  # Attention kernels
+│   │   │   ├── norm/       # Normalization kernels
+│   │   │   ├── rope/       # RoPE kernels
+│   │   │   └── recurrent/  # LSTM/Mamba kernels
+│   │   └── audio/          # Audio processing kernels
+│   └── bindings/           # pybind11 (modular)
+│       ├── gemm/           # GEMM bindings by dtype
+│       ├── gemv/           # GEMV bindings by dtype
+│       └── nn/             # NN operation bindings
 ├── rust/
 │   ├── pygpukit-core/      # Pure Rust GPU runtime
 │   │   └── src/
 │   │       ├── memory/     # MemoryPool, LRU, size-class allocator
 │   │       ├── scheduler/  # Task state machine, QoS policies
 │   │       └── device.rs   # DeviceCapabilities, KernelType
 │   └── pygpukit-python/    # PyO3 bindings
-├── examples/
-├── benchmarks/             # Performance benchmarks
+├── examples/               # Example scripts (organized)
+│   ├── benchmarks/         # Performance benchmarks
+│   ├── chat/               # Chat CLI applications
+│   ├── demos/              # Feature demos
+│   │   └── archived/       # Version-specific demos (historical)
+│   └── demo_*.py           # Current feature demos
 └── tests/
 ```
 

diff --git a/examples/README.md b/examples/README.md
@@ -1,43 +1,87 @@
 # PyGPUkit Examples
 
+## Directory Structure
+
+```
+examples/
+├── benchmarks/           # Performance benchmarks
+├── chat/                 # Chat CLI applications
+├── demos/archived/       # Version-specific demos (historical)
+├── demo_*.py             # Current feature demos
+├── tts.py                # Text-to-speech example
+└── whisper_realtime_stt.py  # Speech-to-text example
+```
+
 ## Requirements
 
-- NVIDIA GPU with CUDA support
-- CUDA Toolkit 12.x
+- NVIDIA GPU with SM >= 80 (Ampere or newer)
+- CUDA Toolkit 12.x or 13.x
 - Built native module (`_pygpukit_native`)
 
-## Examples
+## Quick Start
 
-### demo_gpu.py
-Basic GPU operations demo using the native C++ backend directly.
+### Chat CLI
 
 ```bash
+# Standard chat (Qwen)
+python examples/chat/chat_cli.py
+
+# With Triton backend
+python examples/chat/chat_cli_triton.py
+
+# MoE models (Qwen3)
+python examples/chat/chat_cli_moe.py
+
+# Thinking mode (Qwen3-8B-Thinking)
+python examples/chat/chat_cli_thinking.py
+```
+
+### Demos
+
+```bash
+# Basic GPU operations
 python examples/demo_gpu.py
+
+# CUDA Graph for LLM inference
+python examples/demo_cuda_graph.py
+
+# End-to-end LLM demo
+python examples/demo_llm_e2e.py
+
+# Qwen3 model demo
+python examples/demo_qwen3.py
 ```
 
-### demo_optimized.py
-Performance comparison showing zero-copy optimizations.
+### Benchmarks
 
 ```bash
-python examples/demo_optimized.py
+# Matrix multiplication benchmark
+python examples/benchmarks/benchmark_matmul.py
+
+# CUDA Graph LLM benchmark
+python examples/benchmarks/bench_cuda_graph_llm.py
+
+# Compare with cuBLAS
+python examples/benchmarks/benchmark_compare.py
 ```
 
-### demo_v01.py
-Simple v0.1 feature demonstration (CPU simulation fallback).
+### Speech/Audio
 
 ```bash
-python examples/demo_v01.py
+# Text-to-speech (Kokoro)
+python examples/tts.py
+
+# Real-time speech-to-text (Whisper)
+python examples/whisper_realtime_stt.py
 ```
 
 ## Building Native Module
 
 ```bash
-cd native
-mkdir build && cd build
-cmake .. -DCMAKE_BUILD_TYPE=Release
-cmake --build . --config Release
-```
+# From project root using build script
+./build.sh 86      # RTX 3090 Ti
+./build.sh 120a    # RTX 5090
 
-Copy the built module to `src/pygpukit/`:
-- Linux: `_pygpukit_native.cpython-3xx-x86_64-linux-gnu.so`
-- Windows: `_pygpukit_native.cp3xx-win_amd64.pyd`
+# Or manually with pip
+pip install -e . -v
+```
diff --git a/examples/bench_cuda_graph_llm.py → examples/benchmarks/bench_cuda_graph_llm.py b/examples/bench_cuda_graph_llm.py → examples/benchmarks/bench_cuda_graph_llm.py
diff --git a/examples/benchmark_compare.py → examples/benchmarks/benchmark_compare.py b/examples/benchmark_compare.py → examples/benchmarks/benchmark_compare.py
diff --git a/examples/benchmark_large.py → examples/benchmarks/benchmark_large.py b/examples/benchmark_large.py → examples/benchmarks/benchmark_large.py
diff --git a/examples/benchmark_matmul.py → examples/benchmarks/benchmark_matmul.py b/examples/benchmark_matmul.py → examples/benchmarks/benchmark_matmul.py
diff --git a/examples/benchmark_tiled_matmul.py → ...ples/benchmarks/benchmark_tiled_matmul.py b/examples/benchmark_tiled_matmul.py → ...ples/benchmarks/benchmark_tiled_matmul.py
diff --git a/examples/chat_cli.py → examples/chat/chat_cli.py b/examples/chat_cli.py → examples/chat/chat_cli.py