⚠️ Work in Progress — This book is actively being written. Chapters may change.
Learn How LLM Inference Works at Scale — From the inside Out.
You send a prompt to ChatGPT, Claude, or a local model. Tokens stream back one by one. Behind the scenes, an inference engine is managing memory, batching requests, and deciding what to compute next.
Most developers never look inside that machine. This book opens it up — and optionally guides you through building one from scratch. Same core ideas behind vLLM, SGLang, TGI, and every production inference engine. Read to learn, or build to internalize — either way, you'll understand every layer.
Read only. Start at Chapter 1. Every concept is explained with diagrams, traces, and full output — no setup required to follow along.
Build as you read. Start at Chapter 0 to set up your environment. Each chapter has a spec and validation tests. You build, you run, you see it work. Specs are language-agnostic — bring Rust, Python, Go, or whatever you prefer.
| Ch | Title | What You Build |
|---|---|---|
| -- | Preface | Why this book exists, the roadmap, running example |
| 00 | Setup (Optional) | Prerequisites, workflow, validation harness |
| Ch | Title | What You Build |
|---|---|---|
| 01 | The LLM Inference Problem | Why inference is hard — latency, memory, throughput |
| 02 | vLLM Architecture Overview | Mental model of the full system |
| 03 | Setting Up the Project | Project structure, dependencies, smoke test |
| Ch | Title | What You Build |
|---|---|---|
| 04 | GPT-2 from Scratch | Load and run a real model end-to-end |
| 05 | The Building Blocks | Layers, attention, MLP — the Transformer stack |
| 06 | Where the Model Learns to Look Back | KV cache for autoregressive generation |
| 07 | The Skeleton Speaks | First working generation — prompt in, text out |
| 08 | Fit and Finish | Tokenizer integration, clean output, greedy decode |
| Ch | Title | What You Build |
|---|---|---|
| 09 | The Memory Problem | Why naive KV caching breaks at scale |
| 10 | Paged Attention | Block-based KV cache — virtual memory for attention |
| 11 | Continuous Batching | Serve multiple requests without wasting compute |
| 12 | The Scheduler | Priority, preemption, fairness across requests |
| 13 | The Engine Loop | Orchestrating scheduler → model → output |
| Ch | Title | What You Build |
|---|---|---|
| 14 | Sampling Strategies | Temperature, top-k, top-p, repetition penalty |
| 15 | Building the API Server | OpenAI-compatible HTTP + streaming |
| 16 | Prefix Caching | Reuse KV blocks across prompts with shared prefixes |
| 17 | Speculative Decoding | Draft model + verification for faster generation |
| 18 | Structured Output | Constrained decoding — JSON, grammar, schema |
| 19 | Parallelism | Tensor and pipeline parallelism across devices |
| Ch | Title | What You Build |
|---|---|---|
| 20 | Where to Go from Here | Research landscape, open problems, next steps |
Throughout the book, we trace a single prompt through the system:
"What is AI?" → token IDs:
[2061, 318, 9552, 30]
These four tokens flow through tokenizers, embedding tables, attention heads, KV caches, block tables, and schedulers. By the end, you'll know every step of their journey.
- Read the chapter — understand the concept and why it matters.
- Read the spec — each chapter has a spec in
spec/chNN/with interface contracts and expected behavior. - Build it — implement the spec in your language. No code to copy. The spec is all you need.
- Validate — run the validation tests in
spec/chNN/validation/to confirm your implementation works. - Move on — each chapter builds on the last. Your inference engine grows incrementally.
Using Claude Code? Install The Builder's Handbook (TBH) plugin for a guided build-along experience — specs, hints, validation, and progress tracking right inside your terminal:
/plugin marketplace add tbhbooks/tbh-skill
/plugin install tbh@the-builders-handbook
/tbh:setupEach chapter's spec lives in spec/chNN/:
spec/chNN/
├── prompt-template.md What to implement (language-agnostic)
├── interface-spec.md API contracts and types
├── expected-output.txt What the program should produce
├── component-diagram.md Architecture diagram
├── sequence-diagram.md Data flow diagram
└── validation/
└── test_chNN.py Automated tests your code must pass
- A programming language you're comfortable with
- Python 3.10+ with pytest (for validation tests)
- For Chapter 4+: a machine that can run GPT-2 (CPU works, GPU faster)
- Curiosity about how LLM inference actually works
Copyright (c) 2026 Rushit Patel. All rights reserved. See LICENSE.
"tbh, the spec is all you need."