- SafeTensors Support: Secure model serialization with automatic sharding for large models
- HuggingFace Integration: Use any pretrained tokenizer via
HFTokenizerWrapper
- Accelerate Support: Distributed training with
use_accelerate=true
- LoRA/PEFT: Parameter-efficient fine-tuning with
use_peft=true
- Backward Compatible: Existing PyTorch models continue to work
- Custom Transformer Implementation: Multi-head attention, feed-forward networks, positional encodings
- SafeTensors Integration: Secure model serialization with automatic sharding
- Modular Design: Easy to extend and customize for research and production
- BPE Tokenizer: From-scratch BPE with Unicode and emoji support
- HuggingFace Integration: Use any pretrained tokenizer (Mistral, Llama, GPT-2, etc.)
- WordPiece Support: Alternative tokenization strategies
- HuggingFace Datasets: Efficient loading with preprocessing and batching
- Memory Optimization: Smart sequence packing and data streaming
- Multi-Processing: Parallel data preprocessing for faster training
- CPU/GPU Support: Optimized configurations for both CPU and GPU training
- Distributed Training: Multi-GPU support via Accelerate and DeepSpeed
- Parameter-Efficient: LoRA/PEFT adapters for memory-efficient fine-tuning
- Mixed Precision: FP16/BF16 automatic mixed precision
- Multiple Decoding Strategies: Greedy, beam search, nucleus (top-p), top-k sampling
- TensorBoard Integration: Real-time training metrics and visualizations
- Weights & Biases: Experiment tracking and hyperparameter optimization
- Comprehensive Metrics: Perplexity, cross-entropy loss, generation quality
- Python 3.8 or higher
- PyTorch 2.0 or higher
- GPU: CUDA-compatible GPU (recommended) or CPU-only mode
- Memory: 8GB RAM minimum (16GB+ recommended)
git clone https://github.com/HelpingAI/llm-trainer.git
cd llm-trainer
pip install -e .
# Development tools
pip install -e ".[dev]"
# SafeTensors support (recommended)
pip install -e ".[safetensors]"
# Distributed training
pip install -e ".[distributed]"
# All features
pip install -e ".[full]"
from llm_trainer.config import ModelConfig, TrainingConfig, DataConfig
from llm_trainer.models import TransformerLM
from llm_trainer.tokenizer import BPETokenizer
from llm_trainer.training import Trainer
# Create and train tokenizer
tokenizer = BPETokenizer()
tokenizer.train_from_dataset(
dataset_name="wikitext",
dataset_config="wikitext-2-raw-v1",
vocab_size=32000
)
# Configure model
model_config = ModelConfig(
vocab_size=tokenizer.vocab_size,
d_model=512,
n_heads=8,
n_layers=6,
max_seq_len=1024
)
# Create model
model = TransformerLM(model_config)
# Configure training
training_config = TrainingConfig(
batch_size=16,
learning_rate=1e-4,
num_epochs=3,
warmup_steps=1000,
checkpoint_dir="./checkpoints"
)
# Configure data
data_config = DataConfig(
dataset_name="wikitext",
dataset_config="wikitext-2-raw-v1",
max_length=1024
)
# Train the model
trainer = Trainer(model, tokenizer, training_config)
trainer.train_from_config(model_config, data_config)
from llm_trainer.tokenizer import HFTokenizerWrapper
from llm_trainer.models import HuggingFaceModelWrapper
# Load pretrained tokenizer and model
tokenizer = HFTokenizerWrapper("microsoft/DialoGPT-medium")
model = HuggingFaceModelWrapper("microsoft/DialoGPT-medium")
# Configure PEFT training
training_config = TrainingConfig(
use_accelerate=True,
use_peft=True,
peft_type="lora",
peft_r=8,
peft_alpha=16
)
trainer = Trainer(model, tokenizer, training_config)
trainer.train_from_config(model_config, data_config)
# GPU Training
python scripts/train.py --config configs/small_model.yaml --output_dir ./output
# CPU Training (no GPU required)
python scripts/train.py --config configs/cpu_small_model.yaml --output_dir ./output
# Text Generation
python scripts/generate.py --model_path ./output --prompts "The quick brown fox" --interactive
# Model Evaluation
python scripts/evaluate.py --model_path ./output --dataset_config configs/eval_config.json
The framework uses YAML/JSON configuration files for reproducible experiments:
model:
d_model: 512
n_heads: 8
n_layers: 6
vocab_size: 32000
max_seq_len: 1024
training:
batch_size: 16
learning_rate: 1e-4
num_epochs: 3
use_amp: true
gradient_accumulation_steps: 4
device: "cpu"
model:
d_model: 256
n_heads: 4
n_layers: 4
max_seq_len: 512
training:
batch_size: 2
use_amp: false
gradient_accumulation_steps: 8
dataloader_num_workers: 2
model:
d_model: 768
n_heads: 12
n_layers: 12
training:
use_accelerate: true
accelerate_mixed_precision: "fp16"
use_peft: true
peft_type: "lora"
peft_r: 8
peft_alpha: 16
# SafeTensors settings
save_format: "safetensors"
max_shard_size: "2GB"
llm-trainer/
├── src/llm_trainer/ # Main package
│ ├── models/ # Model architectures
│ │ ├── base_model.py # Base model interface
│ │ ├── transformer.py # Custom Transformer implementation
│ │ ├── safetensors_utils.py # SafeTensors utilities
│ │ └── attention.py # Attention mechanisms
│ ├── tokenizer/ # Tokenization
│ │ ├── bpe_tokenizer.py # BPE implementation
│ │ ├── hf_tokenizer.py # HuggingFace wrapper
│ │ └── wordpiece_tokenizer.py # WordPiece implementation
│ ├── data/ # Data pipeline
│ │ ├── dataset.py # Dataset classes
│ │ ├── dataloader.py # Data loading
│ │ └── preprocessing.py # Data preprocessing
│ ├── training/ # Training infrastructure
│ │ ├── trainer.py # Main training logic
│ │ ├── optimizer.py # Optimizers
│ │ └── scheduler.py # Learning rate schedulers
│ ├── utils/ # Utilities
│ │ ├── generation.py # Text generation
│ │ ├── inference.py # Inference utilities
│ │ └── metrics.py # Evaluation metrics
│ └── config/ # Configuration
│ ├── model_config.py # Model configuration
│ ├── training_config.py # Training configuration
│ └── data_config.py # Data configuration
├── scripts/ # CLI tools
│ ├── train.py # Training script
│ ├── generate.py # Text generation
│ └── evaluate.py # Model evaluation
├── configs/ # Pre-configured setups
│ ├── small_model.yaml # Small GPU model
│ ├── medium_model.yaml # Medium GPU model
│ ├── cpu_small_model.yaml # CPU-optimized small
│ └── cpu_medium_model.yaml # CPU-optimized medium
├── examples/ # Usage examples
│ ├── complete_pipeline.py # End-to-end example
│ ├── safetensors_example.py # SafeTensors demo
│ └── train_small_model.py # Quick start example
└── docs/ # Documentation
- Getting Started Guide — Complete setup and first steps
- Model Architecture — Transformer implementation details
- Training Guide — Comprehensive training tutorial
- CPU Training Guide — Dedicated CPU training documentation
- Tokenizer Details — BPE tokenizer documentation
- API Reference — Complete API documentation
pip install -e ".[dev]"
pytest tests/
black src/ scripts/ examples/
flake8 src/ scripts/ examples/
mypy src/
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Bug Reports: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Read the Docs