Skip to content

akhilvreddy/transformers-reimplementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shakespeare Transformer — Decoder-Only Transformer from Scratch (just like ChatGPT)

This project implements the Attention Is All You Need architecture from scratch in PyTorch: no high-level abstractions, no Hugging Face Transformers, just tensors.

I wrote:

  • full decoder-only transformer (GPT-style)
  • learned token embeddings + sinusoidal positional encodings
  • multi-head self-attention with causal masking
  • layerNorm, dropout, residual connections, and FFN
  • custom training loop with batching + MPS acceleration
  • character-level training on Shakespeare's works (no tokenization for simplicity)
  • greedy sampling (argmax decoding, going to add support for beam search / top k soon)
  • generation that actually speaks English (1500s english to be fair)

Output included structured, Shakespearean dialogue with named characters

Result after ~2 min of training

Please check out notebook/demo_transformer.ipynb for the end to end model training.


Architecture

  • TokenEmbedding: nn.Embedding × √d_model
  • PositionalEncoding: fixed sinusoidal
  • MultiHeadAttention: learnable Q/K/V projection per head
  • Decoder: stacked decoder blocks with:
    • (MH) Self-attention
    • LayerNorm
    • Feedforward
    • Residuals
  • no encoder, no teacher forcing — full auto-regressive decoding.

Training Setup

  • Dataset: tiny-shakespeare
  • Level: Character
  • Model size: d_model=512, 6 layers, 8 heads
  • Loss: CrossEntropy
  • Optim: Adam (with label smoothing support)
  • Device: Apple M4 with MPS acceleration
  • Epoch time: ~1 mins
  • Final loss: ~0.13

About

A reimplementation of Attention Is All You Need. (I made a model speak like shakespeare)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors