Skip to content

Kushagra651/LexRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

βš–οΈ LexRA β€” Legal Reasoning Assistant

Fine-tuned TinyLlama 1.1B on 22,000+ Indian Supreme Court Judgements using QLoRA for domain-specific legal language understanding.

HuggingFace Model HuggingFace Space Python License


πŸ“Œ Table of Contents


πŸ” Overview

LexRA is an end-to-end LLM finetuning project that adapts TinyLlama 1.1B to Indian legal language using QLoRA β€” the industry-standard technique for efficient finetuning on limited hardware.

Property Value
Base Model TinyLlama 1.1B Chat
Dataset Indian Supreme Court Judgements
Training Samples 22,146
Finetuning Method QLoRA (4-bit NF4 + LoRA)
Trainable Parameters 6.3M / 1.1B (0.57%)
Training Hardware Google Colab T4 GPU (16GB)
Base Perplexity 6.51
Finetuned Perplexity 3.07
Improvement 52.84%

🎯 Demo

Live Demo: huggingface.co/spaces/aapnakaamkar/LexRA

Model Weights: huggingface.co/aapnakaamkar/LexRA-TinyLlama-Legal


❗ Problem Statement

General-purpose language models perform poorly on domain-specific legal text because:

  • Legal language has highly specialized vocabulary (appellant, petitioner, writ, cognizable)
  • Indian legal documents follow specific structural patterns different from general English
  • Full finetuning of 1.1B+ parameter models requires 14GB+ GPU memory β€” inaccessible on free hardware

βœ… Solution

  1. Load TinyLlama in 4-bit NF4 quantization β€” reduces memory from 4.4GB to 550MB
  2. Inject LoRA adapters (rank 8) β€” only 0.57% of parameters trained
  3. Train on 22,000 Indian Supreme Court judgements as instruction-response pairs
  4. Evaluate using perplexity β€” 52.84% improvement over base model
  5. Deploy via Gradio on HuggingFace Spaces with side-by-side comparison

πŸ“Š Key Results

Metric                  Base TinyLlama    LexRA Finetuned    Improvement
-------------------------------------------------------------------------
Perplexity              6.51              3.07               52.84% ↓
Cross Entropy Loss      1.874             1.121              40.18% ↓
Trainable Parameters    1.1B              6.3M               99.43% ↓
Model Memory (GPU)      4.4 GB            0.55 GB            87.5%  ↓
Adapter File Size       β€”                 ~50 MB             β€”

πŸ“ Project Structure

LexRA/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   └── processed/
β”‚       β”œβ”€β”€ train.jsonl             # 22,146 samples
β”‚       └── val.jsonl               # 2,461 samples
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ prepare_data.py
β”‚   └── evaluate.py
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ app.py
β”‚   └── inference.py
β”œβ”€β”€ notebooks/
β”‚   └── train_collab.ipynb
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ perplexity_results.json
β”‚   └── comparison_screenshot.png
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── README.md

πŸ› οΈ Tech Stack

Component Technology Purpose
Language Python 3.10 Core development
Deep Learning PyTorch Tensor operations, GPU compute
LLM Framework HuggingFace Transformers Model loading, tokenization, training
Efficient Finetuning PEFT LoRA adapter injection and management
Quantization BitsAndBytes 4-bit NF4 quantization
Training Loop HuggingFace Trainer Automated training with checkpointing
Data Processing HuggingFace Datasets Dataset loading and preprocessing
Mixed Precision Accelerate fp16 training on GPU
Frontend Gradio Web interface for inference
Model Hosting HuggingFace Hub Adapter weights storage
Deployment HuggingFace Spaces Public inference API with auto Docker CI/CD
Training Hardware Google Colab T4 Free 16GB GPU

πŸ“š Core Concepts

QLoRA β€” Quantized Low Rank Adaptation

QLoRA combines quantization and LoRA to make finetuning large models on limited hardware possible.

Q = Quantization (4-bit NF4)

float32:  1.1B Γ— 4 bytes  = 4.4 GB
NF4 4bit: 1.1B Γ— 0.5 byte = 0.55 GB  (87.5% reduction)

LoRA = Low Rank Adaptation

Instead of updating W (4096Γ—4096 = 16.7M params):
Learn A (4096Γ—8) + B (8Γ—4096) = 65K params

Output = WΓ—x + BΓ—AΓ—x
W stays frozen. Only A and B are trained.

NF4 β€” NormalFloat 4-bit

Neural network weights follow a normal distribution β€” most values cluster near zero. NF4 places its 16 quantization buckets based on this distribution, concentrating precision where most weights exist.

Regular 4-bit: equal bucket spacing  β†’ more rounding error
NF4:           normal dist spacing   β†’ less rounding error for LLM weights

Perplexity

Measures how "confused" the model is predicting each next token. Lower = better.

Perplexity = e^(cross_entropy_loss)

Base:       e^1.874 = 6.51  (confused between ~6 words per step)
Finetuned:  e^1.121 = 3.07  (confused between ~3 words per step)

Projection Layers

Linear transformations inside attention that map input to different representation spaces:

q_proj β†’ Query:  what am I looking for?
k_proj β†’ Key:    what do I contain?
v_proj β†’ Value:  what do I return if selected?

LoRA targets q_proj and v_proj β€” the most impactful layers for domain adaptation.

Gradient Accumulation

Simulates larger batch sizes on limited GPU memory:

batch_size=2 Γ— gradient_accumulation=8 β†’ effective batch = 16
Accumulate gradients for 16 samples before one weight update
Same result as batch_size=16 at 8Γ— less memory

LoRA Adapter Initialization

A β†’ random Gaussian initialization
B β†’ zero initialization

At start: BΓ—A = 0, model behaves exactly like base model
As training progresses: B learns, corrections gradually applied

B is initialized to zero (not A) so gradients flow properly to both matrices from step 1.


πŸ“¦ Dataset

Source: viber1/indian-law-dataset

Property Value
Total Samples 24,607
Train Split 22,146 (90%)
Validation Split 2,461 (10%)
Columns Instruction, Response
Domain Indian Supreme Court Judgements

Prompt Format:

### Instruction:
What is the meaning of anticipatory bail?

### Response:
Anticipatory bail is a direction to release a person on bail
issued even before the person is arrested...

πŸ—οΈ Model Architecture

Input β†’ Tokenizer β†’ Embeddings
                        ↓
            [Transformer Block Γ— 22]
                        β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                       β”‚
        WΓ—x (frozen NF4)        BΓ—AΓ—x (LoRA, trainable)
            β”‚                       β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    Addition
                        ↓
                [Next Layer]
                        ↓
            Language Model Head
                        ↓
            Token Probabilities β†’ Text

LoRA Config:

LoraConfig(r=8, lora_alpha=16,
           target_modules=["q_proj", "v_proj"],
           lora_dropout=0.05, bias="none",
           task_type="CAUSAL_LM")

πŸš€ Training Pipeline

HuggingFace Dataset
        ↓ prepare_data.py
JSONL train/val files
        ↓ train_collab.ipynb
Tokenize (max_length=512)
        ↓
Load TinyLlama in 4-bit NF4
        ↓
prepare_model_for_kbit_training()
        ↓
get_peft_model() β†’ 0.57% trainable params
        ↓
HuggingFace Trainer (3 epochs, lr=2e-4, fp16)
Save checkpoint every 500 steps
        ↓
Best model β†’ HuggingFace Hub

πŸ“ˆ Evaluation

python scripts/evaluate.py

Output:

Base Model      β†’ Loss: 1.8740 | Perplexity: 6.5187
Finetuned Model β†’ Loss: 1.1210 | Perplexity: 3.0745

Improvement: 52.84%
Saved to docs/perplexity_results.json

βš™οΈ Installation

git clone https://github.com/aapnakaamkar/LexRA.git
cd LexRA
python -m venv venv
venv\Scripts\activate        # Windows
pip install -r requirements.txt

requirements.txt:

torch
transformers
peft
datasets
gradio
accelerate
bitsandbytes>=0.46.1
ipykernel

πŸ’» Usage

Prepare data:

python scripts/prepare_data.py

Train (Colab):

  1. Upload train_collab.ipynb to Google Colab
  2. Runtime β†’ T4 GPU
  3. Run all cells

Evaluate:

python scripts/evaluate.py

Run app:

python app/app.py
# http://127.0.0.1:7860

Load model in code:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
import torch

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                                              torch_dtype=torch.float32)
model = PeftModel.from_pretrained(model, "aapnakaamkar/LexRA-TinyLlama-Legal")
model = model.merge_and_unload()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                max_new_tokens=256, temperature=0.3, do_sample=True)
prompt = "### Instruction:\nWhat is bail?\n\n### Response:\n"
print(pipe(prompt)[0]["generated_text"].split("### Response:\n")[-1])

🌐 Deployment

Upload app.py + requirements.txt to HuggingFace Space
                    ↓
HF detects Gradio SDK β†’ builds Docker container
                    ↓
Installs requirements β†’ runs app.py
                    ↓
Serves public HTTPS URL (auto-rebuilds on file change)

URL: https://huggingface.co/spaces/aapnakaamkar/LexRA


πŸ† Resume Highlights

  • Fine-tuned TinyLlama 1.1B on 22,000+ Indian Supreme Court judgements using LoRA adapters, reducing trainable parameters by 94% from 1.1B to 6.3M
  • Implemented 4-bit NF4 quantization via BitsAndBytes reducing model memory from 4.4GB to 550MB enabling training on free-tier T4 GPU
  • Engineered data preprocessing pipeline to format 24,000+ raw legal records into instruction-response pairs for supervised finetuning
  • Achieved 52.84% perplexity reduction (6.51 β†’ 3.07) on held-out Indian legal text demonstrating successful domain adaptation
  • Deployed interactive Gradio interface on HuggingFace Spaces showcasing side-by-side base vs finetuned model comparison on Indian legal queries

πŸ“– References

Resource Link
QLoRA Paper (Dettmers et al. 2023) arxiv.org/abs/2305.14314
LoRA Paper (Hu et al. 2021) arxiv.org/abs/2106.09685
TinyLlama huggingface.co/TinyLlama
Dataset viber1/indian-law-dataset
PEFT Library github.com/huggingface/peft
BitsAndBytes github.com/TimDettmers/bitsandbytes
HuggingFace Transformers github.com/huggingface/transformers
Attention Is All You Need arxiv.org/abs/1706.03762

πŸ“„ License

MIT License β€” see LICENSE for details.


πŸ™‹ Author

Kushagra Bhargava


Built with QLoRA, PEFT, and the HuggingFace ecosystem

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors