This project implements a Vision Transformer (ViT) architecture from the ground up in PyTorch, inspired by the research paper
"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
by Dosovitskiy et al. (arXiv:2010.11929)
The model is trained and evaluated on the CIFAR-10 dataset — chosen for its compact size and suitability for experimentation.
This project was a hands-on exploration of the core ViT architecture, aimed at understanding transformer-based vision models without relying on pre-built modules or pre-trained weights.
- ViT architecture implemented entirely from scratch
- Patch embedding via
Conv2dwith flattening and projection - Learnable positional encodings with a
[CLS]token - Transformer encoder blocks with:
- Multi-head self-attention
- MLP layers + GELU activation
- Classification head operating on
[CLS]token - Custom training and evaluation loops
- Grid-based visualization of predictions for qualitative insight
- Clone the repository:
git clone https://github.com/HrishikeshUchake/vit-from-scratch.git
cd vit-from-scratch- Install dependencies:
pip install torch torchvision matplotlib- Uses CIFAR-10 from
torchvision.datasets - Automatically downloads and normalizes the dataset
- 10 classes, 32×32 color images, ideal for quick transformer training experiments
After training, the model produces color-coded grid plots of predictions vs ground truth — useful for:
- Identifying common failure modes
- Visual confirmation of model confidence
- Quick debugging and qualitative evaluation
MIT License — feel free to fork, modify, and build upon the code for personal or academic use.
Developed by Hrishikesh Uchake
- Support for larger datasets (e.g. CIFAR-100, TinyImageNet)
- Accuracy/loss logging with
TensorBoard - CLI training and evaluation wrapper