This is an extension of the scratch_former project for Indian Knowledge Systems
A GPT-like transformer model built from scratch for generating and understanding text from Indian Knowledge Systems (IKS), including Sanskrit verses, definitions, and historical texts. This project implements a complete pipeline from tokenization to model training and inference.
- Custom Transformer Architecture: From-scratch implementation of multi-head attention, feed-forward networks, and transformer blocks
- Tokenization: BPE-based efficient tokenization for IKS text with support for special tokens and multiple scripts
- Training Pipelines: Multiple training approaches including standard PyTorch, fine-tuning, and PyTorch Lightning implementations
- Data Processing: Utilities for handling verses, words, and text chunks with noise augmentation
- Inference: Text generation with sampling strategies (top-p filtering, repetition penalty)
-
basic.py: Core transformer architecture implementationSelfAttn: Single-head self-attention mechanismMultiHeadAttn: Multi-head attention with multiple parallel attention headsTransfBlock: Transformer block combining attention and feed-forward layersTasnsfModel: Complete transformer model with embeddings and output projection
-
model_config.py: Hyperparameter configuration for a 300M parameter transformer model- Vocabulary size: 25,000
- Sequence length: 1,024
- Embedding dimension: 1,024
- Number of attention heads: 16
- Number of transformer layers: 15
-
pl_train.py: PyTorch Lightning training implementation- Distributed training support with PyTorch Lightning
- Checkpoint management and logging using Lightning callbacks
- Logs saved in
pl_models/lightning_logs/directory
-
ft_train.py: Fine-tuning training script- Loads pre-trained model from
models/ - Trains on question-answer type data
- Saves fine-tuned models in
ft_models/directory
- Loads pre-trained model from
Scripts for loading the checkpoints for inference and testing. With options for setting parameters like temperature, Top-p (nucleus) and Repetition penalty
pl_test.py: Text generation script for the model trained using PyTorch Lightning.ft_test.ipynb: Jupyter notebook for testing the answer generation for questions using the finetuned models.
-
tokenizer/bpe.py: BPE (Byte Pair Encoding) tokenizer training- Builds vocabulary from verse, word, and chunk data
- Learns merge operations for frequent byte-pair combinations
- Outputs vocabulary with frequency information
-
tokenizer/merges.json,tokenizer/merges_spl.json: Learned BPE merge vocabularies -
tokenizer/tokenizer.py: BPE tokenizer implementation- Uses pre-learned merges from JSON files
- Encodes text to token IDs and Decodes token IDs back to text
- Supports special tokens (
<pad>,<eos>,<user>,<system>) - Usage:
from tokenizer.tokenizer import encode, decode # Tokenize text tokens = encode("sita the daughter of") # Decode tokens back to text text = decode(tokens)
-
utils.py,ft_utils.py,val_utils.py: Utility functions for training, validation and finetuning the model with functions likeaks_translit(),shuff_drop(), andadd_noise()
old_scripts: Previous versions of the training and testing scripts.
- Verses: Sanskrit verses from Indian Knowledge Systems
- Words: Dictionary words with definitions
- Chunks: Larger text passages and chapters
# PyTorch Lightning training
python pl_train.py# Generate text from seed
python pl_test.pyquestion = '''meaning of "dharma"?'''
prompt = "<user>" + question + "<system>"
print(generate(prompt))
# <user>meaning of "dharma"?<system>
# the term "dharma" refers to a specific duty or office that is considered righteous. it can refer to the duties of a king, a king, and also to the duties of an ascetic (stayed in pious practices).<eos>