tinygpt is a compact training and evaluation stack for small GPT-style models.
The repository is organized around four opinionated workflows:
- Full training from scratch: tokenizer training, pretraining, SFT, and evaluation.
- Pretraining with the
karpathy/nanochat-d32tokenizer. - Distillation from
karpathy/nanochat-d32into a student trained with the same tokenizer. - A smoke test for CPU or a small GPU.
| Workflow | Script | What it does |
|---|---|---|
| From scratch | runs/from_scratch.sh |
Trains a tokenizer, pretrains a base model, runs SFT, then evaluates the result. |
| Nanochat tokenizer pretrain | runs/pretrain_with_nanochat_d32.sh |
Reuses the karpathy/nanochat-d32 tokenizer and runs pretraining only. |
| Distillation | runs/distill_from_nanochat_d32.sh |
Distills from karpathy/nanochat-d32 into a student checkpoint produced by pretrain_with_nanochat_d32.sh. |
| Smoke test | runs/smoke.sh |
Runs a minimal end-to-end validation path on CPU or a small GPU. |
Run commands from the tinygpt root:
bash runs/from_scratch.sh
bash runs/pretrain_with_nanochat_d32.sh
bash runs/distill_from_nanochat_d32.sh
bash runs/smoke.shAll generated artifacts and support files are stored under data/.
Typical outputs:
data/tokenizer_from_scratchdata/tokenizer_nanochat_d32data/teacher_nanochat_d32data/tokenizer_smokedata/pretrain_checkpoints/from_scratchdata/pretrain_checkpoints/pretrain_with_nanochat_d32data/distill_checkpoints/distill_from_nanochat_d32data/sft_checkpoints/from_scratchdata/sft_checkpoints/smokedata/identity_conversations.jsonl
The run scripts are intentionally simple. Only a small number of environment overrides are supported:
WANDB_RUN: Weights & Biases run name. If unset, scripts default todummy.NPROC_PER_NODE: Number oftorchrunprocesses per node for GPU workflows.DEVICE_TYPE: Runtime override forruns/smoke.sh, typicallycpu,cuda, ormps.TEACHER_DEVICE: Teacher placement override forruns/distill_from_nanochat_d32.sh.
Examples:
WANDB_RUN=from_scratch_exp bash runs/from_scratch.sh
WANDB_RUN=student_d32 bash runs/pretrain_with_nanochat_d32.sh
WANDB_RUN=distill_d32 TEACHER_DEVICE=cpu bash runs/distill_from_nanochat_d32.sh
DEVICE_TYPE=cpu bash runs/smoke.shOnline KL distillation in this codebase requires tokenizer compatibility between teacher and student. In practice, the distillation workflow assumes:
- the teacher is
karpathy/nanochat-d32 - the student was pretrained with
runs/pretrain_with_nanochat_d32.sh
If the student uses a different tokenizer or token ID mapping, distillation will fail by design.
Primary modules:
python -m scripts.train_tokenizerpython -m scripts.pretrainpython -m scripts.finetunepython -m scripts.distillpython -m scripts.evaluate_tokenizerpython -m scripts.evaluate_modelpython -m scripts.chat
Defaults are aligned with the data/ directory layout used by the run scripts.
Expected baseline:
- Python 3.12+
uvfor environment setup- PyTorch-compatible CPU, CUDA, or MPS runtime
The run scripts create or reuse .venv and install dependencies via uv sync.