A minimalist yet performant GPT implementation built with PyTorch. Train or fine‑tune Transformer language models on a single GPU in just a few hundred lines of code.
| Capability | Details |
|---|---|
| Tiny Code Footprint | Core GPT model fits in ≈ 250 LOC. Clear, instructive implementation that maps 1‑to‑1 with the equations. |
| One‑Command Training | Opinionated scripts (fineweb.py, hellaswag.py, train_gpt2.py) abstract away boilerplate—just point to a text file or Hugging Face dataset and go. |
| Single‑GPU Friendly | Out‑of‑the‑box configs run on a consumer‑grade GPU (≥8 GB VRAM) thanks to gradient checkpointing & mixed precision. |
| Pluggable Tokenizers | Works with byte‑level BPE, SentencePiece, or any transformers tokenizer. |
| Research Ready | Explicit hooks for custom loss functions, curriculum schedules, or architectural tweaks. |
.
├── fineweb.py # Train from scratch on the FineWeb corpus
├── hellaswag.py # Few‑shot / zero‑shot tune on HellaSwag (commonsense QA)
├── train_gpt2.py # Re‑implement the original GPT‑2 small config
├── play.ipynb # Notebook playground for rapid experiments
├── input.txt # Tiny demo corpus (Shakespeare)
├── README.md # You are here
└── ...
Prereqs: Python ≥ 3.10 · PyTorch ≥ 2.1 (CUDA 11+) · (optional)
transformers,datasets
# 1⃣ Clone & install
$ git clone https://github.com/<your‑org>/Nano_GPT.git && cd Nano_GPT
$ pip install -r requirements.txt # lightweight: torch, tqdm, sentencepiece, transformers
# 2⃣ Sanity check on the toy corpus
$ python train_gpt2.py --dataset input.txt --max-iter 500 --eval-interval 50
# 3⃣ Generate text
$ python train_gpt2.py --generate "To be, or not to be" --checkpoint ckpt/latest.pt| Script | Dataset | Prep Utility |
|---|---|---|
fineweb.py |
FineWeb/Common Crawl | python tools/prepare_fineweb.py |
hellaswag.py |
HellaSwag | Auto‑downloads via datasets API |
| Custom text | Any UTF‑8 .txt |
python tools/txt2bin.py --vocab‑size 50_000 |
All preprocessing pipelines yield a binary .bin file of token IDs for maximal I/O throughput.
python fineweb.py \
--model_dim 768 \
--n_layer 12 \
--n_head 12 \
--batch_size 4 \
--block_size 1024 \
--epochs 3python hellaswag.py --lr 5e‑5 --epochs 5python train_gpt2.py --config configs/gpt2_small.yaml# Perplexity on validation split
$ python eval.py --checkpoint ckpt/latest.pt --val data/fineweb/val.bin
# Interactive generation
$ python generate.py --checkpoint ckpt/best.pt --prompt "The meaning of life is"| Model | Corpus | PPL ↓ | Training Time* |
|---|---|---|---|
| GPT‑2‑small (ours) | FineWeb 1B | 23.7 | 7 h RTX 4090 |
| GPT‑2‑small (ref) | OpenAI WebText | 29.5 | — |
* Mixed‑precision, bf16. Your mileage may vary.
| Symptom | Likely Cause | Fix |
|---|---|---|
| CUDA OOM | batch too big | Lower batch_size or enable --grad‑ckpt |
| Diverging loss | LR too high | Try cosine scheduler, warmup steps, or lower --lr |
| Slow dataloading | HDD bottleneck | Store .bin on SSD or ramdisk |
PRs welcome — whether it’s bug fixes, new training configs, or dataset loaders. Please run the test suite (pytest) and conform to black formatting.
MIT © 2025 Duy Phúc Lê Nguyễn