Skip to content

lenguyenduyphuc/Nano_GPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nano_GPT

A minimalist yet performant GPT implementation built with PyTorch. Train or fine‑tune Transformer language models on a single GPU in just a few hundred lines of code.


Key Features

Capability Details
Tiny Code Footprint Core GPT model fits in ≈ 250 LOC. Clear, instructive implementation that maps 1‑to‑1 with the equations.
One‑Command Training Opinionated scripts (fineweb.py, hellaswag.py, train_gpt2.py) abstract away boilerplate—just point to a text file or Hugging Face dataset and go.
Single‑GPU Friendly Out‑of‑the‑box configs run on a consumer‑grade GPU (≥8 GB VRAM) thanks to gradient checkpointing & mixed precision.
Pluggable Tokenizers Works with byte‑level BPE, SentencePiece, or any transformers tokenizer.
Research Ready Explicit hooks for custom loss functions, curriculum schedules, or architectural tweaks.

Project Structure

.
├── fineweb.py        # Train from scratch on the FineWeb corpus
├── hellaswag.py      # Few‑shot / zero‑shot tune on HellaSwag (commonsense QA)
├── train_gpt2.py     # Re‑implement the original GPT‑2 small config
├── play.ipynb        # Notebook playground for rapid experiments
├── input.txt         # Tiny demo corpus (Shakespeare)
├── README.md         # You are here
└── ...

Quick Start

Prereqs: Python ≥ 3.10 · PyTorch ≥ 2.1 (CUDA 11+) · (optional) transformers, datasets

# 1⃣ Clone & install
$ git clone https://github.com/<your‑org>/Nano_GPT.git && cd Nano_GPT
$ pip install -r requirements.txt  # lightweight: torch, tqdm, sentencepiece, transformers

# 2⃣ Sanity check on the toy corpus
$ python train_gpt2.py --dataset input.txt --max-iter 500 --eval-interval 50

# 3⃣ Generate text
$ python train_gpt2.py --generate "To be, or not to be" --checkpoint ckpt/latest.pt

Datasets

Script Dataset Prep Utility
fineweb.py FineWeb/Common Crawl python tools/prepare_fineweb.py
hellaswag.py HellaSwag Auto‑downloads via datasets API
Custom text Any UTF‑8 .txt python tools/txt2bin.py --vocab‑size 50_000

All preprocessing pipelines yield a binary .bin file of token IDs for maximal I/O throughput.


Training Recipes

FineWeb (1B tokens) — 125 M param model

python fineweb.py \
  --model_dim 768 \
  --n_layer 12 \
  --n_head 12 \
  --batch_size 4 \
  --block_size 1024 \
  --epochs 3

HellaSwag Few‑Shot

python hellaswag.py --lr 5e‑5 --epochs 5

From‑Scratch GPT‑2 Small Reproduction

python train_gpt2.py --config configs/gpt2_small.yaml

Evaluation & Inference

# Perplexity on validation split
$ python eval.py --checkpoint ckpt/latest.pt --val data/fineweb/val.bin

# Interactive generation
$ python generate.py --checkpoint ckpt/best.pt --prompt "The meaning of life is"

Results

Model Corpus PPL ↓ Training Time*
GPT‑2‑small (ours) FineWeb 1B 23.7 7 h RTX 4090
GPT‑2‑small (ref) OpenAI WebText 29.5

* Mixed‑precision, bf16. Your mileage may vary.


Troubleshooting

Symptom Likely Cause Fix
CUDA OOM batch too big Lower batch_size or enable --grad‑ckpt
Diverging loss LR too high Try cosine scheduler, warmup steps, or lower --lr
Slow dataloading HDD bottleneck Store .bin on SSD or ramdisk

Contributing

PRs welcome — whether it’s bug fixes, new training configs, or dataset loaders. Please run the test suite (pytest) and conform to black formatting.


License

MIT © 2025 Duy Phúc Lê Nguyễn

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors