A lightweight, resource-efficient Large Language Model (LLM) with a command-line chatbot interface and RAG (Retrieval-Augmented Generation) capabilities. Built with PyTorch and optimized for minimal hardware requirements without sacrificing accuracy.
- Efficient Architecture: 168M parameters with Grouped-Query Attention (GQA)
- Memory Optimized: Mixed precision training (FP16), gradient accumulation
- Low Resource Usage: Runs on 4GB+ GPU or CPU
- RAG Support: Answer questions based on your documents (PDF, TXT, DOCX)
- Multiple Datasets: OpenAssistant, Dolly, Alpaca, TinyStories, Code datasets
- Advanced Components:
- Rotary Positional Embeddings (RoPE)
- SwiGLU activation
- RMSNorm layer normalization
- Flash Attention support
- Interactive CLI: Simple command-line chatbot interface with document Q&A
pip install -r requirements.txtOptional (for better RAG):
pip install sentence-transformers PyPDF2 python-docxpython train.pyNow trains on OpenAssistant (high-quality conversations).
Training takes ~2-3 hours on GTX 1650 (4GB).
Regular Chat:
python chatbot.pyRAG Chat (Document Q&A):
python chatbot_rag.pyWith Document Pre-loaded:
python chatbot_rag.py --document sample_document.txtCustom_LLM/
โโโ LLM_architecture_168.py # Core model architecture
โโโ train.py # Training script (OpenAssistant)
โโโ chatbot.py # Simple CLI chatbot
โโโ chatbot_rag.py # RAG-enabled chatbot โญ NEW
โโโ rag_pipeline.py # RAG implementation โญ NEW
โโโ tokenizer_utils.py # Tokenization utilities
โโโ data_loader.py # Multi-dataset loader (updated)
โโโ config.json # Model configuration (GTX 1650 optimized)
โโโ requirements.txt # Python dependencies
โโโ sample_document.txt # Example document for RAG
โโโ test_rag.py # RAG test script
โโโ quick_start.md # Detailed setup guide
โโโ RAG_GUIDE.md # RAG usage guide โญ NEW
โโโ TRAINING_WITH_RAG.md # Complete implementation guide โญ NEW
โโโ checkpoints/ # Saved model checkpoints
- GPU: 4GB VRAM (GTX 1650, GTX 1050 Ti) โญ Optimized
- RAM: 8GB
- Storage: 5GB
- GPU: 6GB+ VRAM (GTX 1660, RTX 3050)
- RAM: 16GB
- Storage: 10GB
Supported but slower (training may take 6-8 hours).
The model is configured in config.json:
{
"vocab_size": 50304,
"dim": 1024, // Model dimension
"n_layers": 12, // Transformer layers
"n_heads": 16, // Attention heads
"n_kv_heads": 4, // KV heads for GQA
"hidden_dim": 2816, // FFN hidden size
"max_seq_len": 512, // Max sequence length
"batch_size": 4,
"gradient_accumulation_steps": 8,
"mixed_precision": true
}- Mixed Precision (FP16): Reduces memory by 50%
- Gradient Accumulation: Simulates larger batches
- Weight Tying: Shares embedding weights
- Efficient Attention: Grouped-Query Attention
- AdamW optimizer with weight decay
- Cosine learning rate schedule with warmup
- Gradient clipping for stability
- Automatic checkpointing
You: Hello!
Bot: Hi! How can I help you today?
Commands:
- 'quit' or 'exit': Exit the chatbot
- 'clear': Clear conversation history
- 'history': Show conversation historyYou: add sample_document.txt
โ Document added: sample_document.txt
You: What is machine learning?
Bot: Based on the document, machine learning is a subset of AI...
Commands:
- 'add <filepath>': Add document to knowledge base
- 'docs': List loaded documents
- 'save kb': Save knowledge base
- 'load kb': Load knowledge base
- 'quit', 'clear', 'history': Same as regular chatbotAdjust in chatbot.py:
temperature: 0.7-1.0 (higher = more creative)top_k: 40-50 (smaller = more focused)top_p: 0.9-0.95 (nucleus sampling)max_new_tokens: 50-200 (response length)
- High-quality human conversations
- 161K samples
- Instruction-following format
- Best for chatbots
- Dolly-15k: Instruction following (fast training)
- Alpaca: Q&A format (good quality)
- TinyStories: Simple text (basic testing)
- Code Search Net: Python code (code tasks)
# In train.py, change dataset_name
dataset_name='oasst' # OpenAssistant (default)
dataset_name='dolly' # Dolly-15k
dataset_name='alpaca' # Alpaca
dataset_name='code_search_net' # CodeAdd your own training data in data_loader.py.
- Reduce batch size to 2 in
config.json - Reduce
max_seq_lento 256 - Reduce model size (fewer layers/smaller dim)
- Train longer (more epochs)
- Use more training data
- Adjust generation temperature
# Install PyTorch first
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Then install other dependencies
pip install transformers datasets tqdm accelerate- GPU (RTX 3060): ~40 min for 3 epochs (10K samples)
- GPU (GTX 1060): ~60 min for 3 epochs
- CPU: 4-6 hours
- Parameters: 168M (0.17B)
- Disk size: ~650MB (FP32), ~325MB (FP16)
- Memory usage: ~1-2GB during inference
python train.py --checkpoint checkpoints/best_model.ptfrom chatbot import Chatbot
bot = Chatbot(model_path='checkpoints/best_model.pt')
response = bot.generate("Your prompt here")# Save only weights
torch.save(model.state_dict(), 'model_weights.pt')- TRAINING_WITH_RAG.md - Complete implementation guide (START HERE) โญ
- RAG_GUIDE.md - RAG usage and best practices
- quick_start.md - Detailed setup instructions
- IMPLEMENTATION_GUIDE.md - Architecture details
- QUICK_REFERENCE.txt - Command reference card
Feel free to submit issues and enhancement requests!
See LICENSE file for details.
- Architecture inspired by modern LLMs (Llama, Mistral)
- Optimizations from Flash Attention and efficient transformers research
- Training techniques from various open-source projects
Built with โค๏ธ for efficient AI custom LLM