Fine-tuned TinyLlama 1.1B on 22,000+ Indian Supreme Court Judgements using QLoRA for domain-specific legal language understanding.
- Overview
- Demo
- Problem Statement
- Solution
- Key Results
- Project Structure
- Tech Stack
- Core Concepts
- Dataset
- Model Architecture
- Training Pipeline
- Evaluation
- Installation
- Usage
- Deployment
- Resume Highlights
- References
LexRA is an end-to-end LLM finetuning project that adapts TinyLlama 1.1B to Indian legal language using QLoRA β the industry-standard technique for efficient finetuning on limited hardware.
| Property | Value |
|---|---|
| Base Model | TinyLlama 1.1B Chat |
| Dataset | Indian Supreme Court Judgements |
| Training Samples | 22,146 |
| Finetuning Method | QLoRA (4-bit NF4 + LoRA) |
| Trainable Parameters | 6.3M / 1.1B (0.57%) |
| Training Hardware | Google Colab T4 GPU (16GB) |
| Base Perplexity | 6.51 |
| Finetuned Perplexity | 3.07 |
| Improvement | 52.84% |
Live Demo: huggingface.co/spaces/aapnakaamkar/LexRA
Model Weights: huggingface.co/aapnakaamkar/LexRA-TinyLlama-Legal
General-purpose language models perform poorly on domain-specific legal text because:
- Legal language has highly specialized vocabulary (appellant, petitioner, writ, cognizable)
- Indian legal documents follow specific structural patterns different from general English
- Full finetuning of 1.1B+ parameter models requires 14GB+ GPU memory β inaccessible on free hardware
- Load TinyLlama in 4-bit NF4 quantization β reduces memory from 4.4GB to 550MB
- Inject LoRA adapters (rank 8) β only 0.57% of parameters trained
- Train on 22,000 Indian Supreme Court judgements as instruction-response pairs
- Evaluate using perplexity β 52.84% improvement over base model
- Deploy via Gradio on HuggingFace Spaces with side-by-side comparison
Metric Base TinyLlama LexRA Finetuned Improvement
-------------------------------------------------------------------------
Perplexity 6.51 3.07 52.84% β
Cross Entropy Loss 1.874 1.121 40.18% β
Trainable Parameters 1.1B 6.3M 99.43% β
Model Memory (GPU) 4.4 GB 0.55 GB 87.5% β
Adapter File Size β ~50 MB β
LexRA/
βββ data/
β βββ raw/
β βββ processed/
β βββ train.jsonl # 22,146 samples
β βββ val.jsonl # 2,461 samples
βββ scripts/
β βββ prepare_data.py
β βββ evaluate.py
βββ app/
β βββ app.py
β βββ inference.py
βββ notebooks/
β βββ train_collab.ipynb
βββ docs/
β βββ perplexity_results.json
β βββ comparison_screenshot.png
βββ requirements.txt
βββ .gitignore
βββ README.md
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.10 | Core development |
| Deep Learning | PyTorch | Tensor operations, GPU compute |
| LLM Framework | HuggingFace Transformers | Model loading, tokenization, training |
| Efficient Finetuning | PEFT | LoRA adapter injection and management |
| Quantization | BitsAndBytes | 4-bit NF4 quantization |
| Training Loop | HuggingFace Trainer | Automated training with checkpointing |
| Data Processing | HuggingFace Datasets | Dataset loading and preprocessing |
| Mixed Precision | Accelerate | fp16 training on GPU |
| Frontend | Gradio | Web interface for inference |
| Model Hosting | HuggingFace Hub | Adapter weights storage |
| Deployment | HuggingFace Spaces | Public inference API with auto Docker CI/CD |
| Training Hardware | Google Colab T4 | Free 16GB GPU |
QLoRA combines quantization and LoRA to make finetuning large models on limited hardware possible.
Q = Quantization (4-bit NF4)
float32: 1.1B Γ 4 bytes = 4.4 GB
NF4 4bit: 1.1B Γ 0.5 byte = 0.55 GB (87.5% reduction)
LoRA = Low Rank Adaptation
Instead of updating W (4096Γ4096 = 16.7M params):
Learn A (4096Γ8) + B (8Γ4096) = 65K params
Output = WΓx + BΓAΓx
W stays frozen. Only A and B are trained.
Neural network weights follow a normal distribution β most values cluster near zero. NF4 places its 16 quantization buckets based on this distribution, concentrating precision where most weights exist.
Regular 4-bit: equal bucket spacing β more rounding error
NF4: normal dist spacing β less rounding error for LLM weights
Measures how "confused" the model is predicting each next token. Lower = better.
Perplexity = e^(cross_entropy_loss)
Base: e^1.874 = 6.51 (confused between ~6 words per step)
Finetuned: e^1.121 = 3.07 (confused between ~3 words per step)
Linear transformations inside attention that map input to different representation spaces:
q_proj β Query: what am I looking for?
k_proj β Key: what do I contain?
v_proj β Value: what do I return if selected?
LoRA targets q_proj and v_proj β the most impactful layers for domain adaptation.
Simulates larger batch sizes on limited GPU memory:
batch_size=2 Γ gradient_accumulation=8 β effective batch = 16
Accumulate gradients for 16 samples before one weight update
Same result as batch_size=16 at 8Γ less memory
A β random Gaussian initialization
B β zero initialization
At start: BΓA = 0, model behaves exactly like base model
As training progresses: B learns, corrections gradually applied
B is initialized to zero (not A) so gradients flow properly to both matrices from step 1.
Source: viber1/indian-law-dataset
| Property | Value |
|---|---|
| Total Samples | 24,607 |
| Train Split | 22,146 (90%) |
| Validation Split | 2,461 (10%) |
| Columns | Instruction, Response |
| Domain | Indian Supreme Court Judgements |
Prompt Format:
### Instruction:
What is the meaning of anticipatory bail?
### Response:
Anticipatory bail is a direction to release a person on bail
issued even before the person is arrested...
Input β Tokenizer β Embeddings
β
[Transformer Block Γ 22]
β
βββββββββββββ΄ββββββββββββ
β β
WΓx (frozen NF4) BΓAΓx (LoRA, trainable)
β β
βββββββββββββ¬ββββββββββββ
Addition
β
[Next Layer]
β
Language Model Head
β
Token Probabilities β Text
LoRA Config:
LoraConfig(r=8, lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM")HuggingFace Dataset
β prepare_data.py
JSONL train/val files
β train_collab.ipynb
Tokenize (max_length=512)
β
Load TinyLlama in 4-bit NF4
β
prepare_model_for_kbit_training()
β
get_peft_model() β 0.57% trainable params
β
HuggingFace Trainer (3 epochs, lr=2e-4, fp16)
Save checkpoint every 500 steps
β
Best model β HuggingFace Hub
python scripts/evaluate.pyOutput:
Base Model β Loss: 1.8740 | Perplexity: 6.5187
Finetuned Model β Loss: 1.1210 | Perplexity: 3.0745
Improvement: 52.84%
Saved to docs/perplexity_results.json
git clone https://github.com/aapnakaamkar/LexRA.git
cd LexRA
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txtrequirements.txt:
torch
transformers
peft
datasets
gradio
accelerate
bitsandbytes>=0.46.1
ipykernel
Prepare data:
python scripts/prepare_data.pyTrain (Colab):
- Upload
train_collab.ipynbto Google Colab - Runtime β T4 GPU
- Run all cells
Evaluate:
python scripts/evaluate.pyRun app:
python app/app.py
# http://127.0.0.1:7860Load model in code:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
import torch
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, "aapnakaamkar/LexRA-TinyLlama-Legal")
model = model.merge_and_unload()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
max_new_tokens=256, temperature=0.3, do_sample=True)
prompt = "### Instruction:\nWhat is bail?\n\n### Response:\n"
print(pipe(prompt)[0]["generated_text"].split("### Response:\n")[-1])Upload app.py + requirements.txt to HuggingFace Space
β
HF detects Gradio SDK β builds Docker container
β
Installs requirements β runs app.py
β
Serves public HTTPS URL (auto-rebuilds on file change)
URL: https://huggingface.co/spaces/aapnakaamkar/LexRA
- Fine-tuned TinyLlama 1.1B on 22,000+ Indian Supreme Court judgements using LoRA adapters, reducing trainable parameters by 94% from 1.1B to 6.3M
- Implemented 4-bit NF4 quantization via BitsAndBytes reducing model memory from 4.4GB to 550MB enabling training on free-tier T4 GPU
- Engineered data preprocessing pipeline to format 24,000+ raw legal records into instruction-response pairs for supervised finetuning
- Achieved 52.84% perplexity reduction (6.51 β 3.07) on held-out Indian legal text demonstrating successful domain adaptation
- Deployed interactive Gradio interface on HuggingFace Spaces showcasing side-by-side base vs finetuned model comparison on Indian legal queries
| Resource | Link |
|---|---|
| QLoRA Paper (Dettmers et al. 2023) | arxiv.org/abs/2305.14314 |
| LoRA Paper (Hu et al. 2021) | arxiv.org/abs/2106.09685 |
| TinyLlama | huggingface.co/TinyLlama |
| Dataset | viber1/indian-law-dataset |
| PEFT Library | github.com/huggingface/peft |
| BitsAndBytes | github.com/TimDettmers/bitsandbytes |
| HuggingFace Transformers | github.com/huggingface/transformers |
| Attention Is All You Need | arxiv.org/abs/1706.03762 |
MIT License β see LICENSE for details.
Kushagra Bhargava
- HuggingFace: aapnakaamkar
- GitHub: kushagra651
Built with QLoRA, PEFT, and the HuggingFace ecosystem