Skip to content

A GPT-2 based language model specifically trained for arithmetic operations, featuring dataset generation, model training, evaluation, and interactive CLI tools.

License

Notifications You must be signed in to change notification settings

mihainadas/calcgpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

CalcGPT: Transformer-Based Arithmetic Language Models

Python 3.8+ PyTorch HuggingFace License: MIT

CalcGPT is a comprehensive framework for building, training, and deploying transformer-based language models specialized in arithmetic operations. It demonstrates how to create domain-specific language models from scratch using modern deep learning techniques.

๐ŸŒŸ Features

๐Ÿ› ๏ธ Dual Interface Design

  • ๐Ÿ“š Python Library (lib/): Professional programmatic API for integration
  • ๐Ÿ–ฅ๏ธ CLI Tools: User-friendly command-line interfaces for interactive usage

๐Ÿงฎ Complete ML Pipeline

  • Dataset Generation: Intelligent arithmetic dataset creation with parameter encoding
  • Dual Tokenization: Character-level and number-level (0-99) tokenization modes
  • Model Training: Advanced transformer training with automatic naming conventions
  • Model Evaluation: Comprehensive assessment across multiple test types
  • Production Inference: High-performance model serving and batch processing
  • Comprehensive Logging: High-traceability logging system for debugging and monitoring

๐Ÿ—๏ธ Professional Architecture

  • Modular Design: Clean separation of concerns with reusable components
  • Configuration Management: Type-safe dataclass configurations
  • Error Handling: Robust error handling and validation throughout
  • Documentation: Comprehensive inline documentation and examples

๐Ÿ“Š Advanced Features

  • Dual Tokenization: Character-level and number-level (0-99) tokenization modes
  • High-Traceability Logging: Component-specific logs with timestamps, thread IDs, and performance monitoring
  • Data Augmentation: Automatic commutative property expansion
  • Intelligent Naming: Models auto-named with architecture and training parameters
  • Multi-format Output: Support for JSON, plain text, and structured outputs
  • Device Optimization: Automatic GPU/MPS/CPU detection and optimization

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/calcgpt.git
cd calcgpt

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

30-Second Demo

# 1. Generate a dataset
python calcgpt_dategen.py -m 10 --max-expressions 100

# 2. Train a model
python calcgpt_train.py --epochs 5 --verbose

# 3. Test the model
python calcgpt.py -i

# 4. Check logs for detailed traceability
ls logs/  # calcgpt.log, train.log, etc.

๐Ÿ“– Usage Guide

๐Ÿ–ฅ๏ธ CLI Tools

Dataset Generation

# Basic dataset (0-10, addition/subtraction)
python calcgpt_dategen.py -m 10

# Large dataset (0-100, all operations)
python calcgpt_dategen.py -m 100 --verbose

# Custom dataset (0-50, addition only, limited)
python calcgpt_dategen.py -m 50 --no-subtraction --max-expressions 1000

Model Training

# Quick training with defaults
python calcgpt_train.py --epochs 10

# Production training with custom architecture
python calcgpt_train.py \
    --embedding-dim 256 \
    --num-layers 8 \
    --num-heads 16 \
    --epochs 50 \
    --batch-size 16 \
    --learning-rate 1e-4

# Training with validation and checkpoints
python calcgpt_train.py \
    --epochs 100 \
    --test-split 0.2 \
    --save-steps 500 \
    --verbose

Model Evaluation

# Quick evaluation
python calcgpt_eval.py --sample 100

# Comprehensive evaluation
python calcgpt_eval.py \
    --sample 1000 \
    --max-tokens 20 \
    --verbose

# Evaluate specific model
python calcgpt_eval.py \
    -m models/calcgpt_emb128_lay6_head8_ep50_bs16_lr1e4_ds15k \
    --dataset datasets/test_set.txt

Interactive Inference

# Interactive mode
python calcgpt.py -i

# Batch processing
python calcgpt.py -b "25+25" "100-33" "67+12"

# File processing with JSON output
python calcgpt.py -f problems.txt -o results.json --format json

# Custom model and parameters
python calcgpt.py \
    -m models/my_model \
    --temperature 0.0 \
    --max-tokens 15 \
    -b "99+1" "50-25"

# Note: Tokenization mode is determined by the trained model
# Use character mode for learning, number mode for production

๐Ÿ“š Python Library

Dataset Generation

from lib import DatasetGenerator, DatagenConfig

# Create configuration
config = DatagenConfig(
    max_value=100,
    operations=['addition', 'subtraction'],
    max_expressions=10000,
    verbose=True
)

# Generate dataset
generator = DatasetGenerator(config)
dataset_path = generator.generate()

# Analyze dataset
dataset = generator.load_dataset(dataset_path)
analysis = generator.analyze_dataset(dataset)
print(f"Generated {len(dataset)} examples")
print(f"Vocabulary: {analysis['vocabulary']}")

Tokenization Modes

from lib import CalcGPTTokenizer

# Character-level tokenization (default)
examples = ['1+1=2', '12+34=46', '99-50=49']
char_tokenizer = CalcGPTTokenizer(examples, mode='char')
print(f"Character mode - Vocab size: {char_tokenizer.vocab_size}")

# Number-level tokenization (0-99 as single tokens)
num_tokenizer = CalcGPTTokenizer(examples, mode='number')
print(f"Number mode - Vocab size: {num_tokenizer.vocab_size}")

# Compare tokenization
text = "12+34=46"
char_tokens = char_tokenizer.encode(text)  # [1,2,+,3,4,=,4,6] - 8 tokens
num_tokens = num_tokenizer.encode(text)    # [12,+,34,=,46] - 5 tokens

# Load from dataset with mode selection
tokenizer = CalcGPTTokenizer.from_dataset(mode='number')
info = tokenizer.get_vocab_info()
print(f"Mode: {info['mode']}, Numbers: {info['numbers_count']}")

Model Training

from lib import CalcGPTTrainer, TrainingConfig
from pathlib import Path

# Training configuration
config = TrainingConfig(
    epochs=20,
    batch_size=8,
    learning_rate=1e-3,
    embedding_dim=128,
    num_layers=6,
    num_heads=8,
    test_split=0.2,
    verbose=True
)

# Train model
trainer = CalcGPTTrainer(
    config=config,
    dataset_path="datasets/my_dataset.txt",
    output_dir=Path("models/my_calcgpt"),
    verbose=True
)

results = trainer.train()
print(f"Final loss: {results['training_loss']:.4f}")
print(f"Model parameters: {results['model_params']:,}")

Model Evaluation

from lib import CalcGPTEvaluator, EvaluationConfig

# Evaluation configuration
config = EvaluationConfig(
    sample_size=500,
    max_tokens=15,
    verbose=True
)

# Evaluate model
evaluator = CalcGPTEvaluator(
    config=config,
    model_path="models/my_calcgpt",
    dataset_path="datasets/test_set.txt"
)

results = evaluator.evaluate()
print(f"Overall accuracy: {results['accuracy_stats']['overall']:.1%}")
print(f"Arithmetic correctness: {results['accuracy_stats']['arithmetic']:.1%}")

Model Inference

from lib import CalcGPT, InferenceConfig

# Inference configuration
config = InferenceConfig(
    temperature=0.0,
    max_tokens=10,
    verbose=False
)

# Load model
model = CalcGPT(
    config=config,
    model_path="models/my_calcgpt"
)

# Generate predictions
result = model.generate("25+25=")
print(f"Prediction: {result['completion']}")

# Batch processing
problems = ["10+5=", "20-7=", "99+1="]
for problem in problems:
    result = model.generate(problem)
    print(f"{problem} -> {result['completion']}")

Comprehensive Logging

from lib.logger import setup_logging, get_logger, log_step, log_metric, log_performance

# Setup logging system
setup_logging(
    logs_dir="logs",
    console_level="INFO",  # Console output level
    file_level="DEBUG"     # File output level (more detailed)
)

# Get component-specific loggers
train_logger = get_logger('train')
inference_logger = get_logger('inference')

# Basic logging
train_logger.info("Starting training process")
inference_logger.warning("Model accuracy below threshold")

# Structured logging with convenience functions
log_step("Epoch 1 completed", 'train')
log_metric("accuracy", 0.95, 'train')

# Performance monitoring with decorators
@log_performance('model_training', 'train')
def train_model():
    # Training code here
    return {"loss": 0.25}

# Function tracing
@log_function('inference', log_args=True, log_result=True)
def predict(input_data):
    return f"prediction for {input_data}"

# Automatic component-specific log files:
# - logs/calcgpt.log (main log)
# - logs/train.log (training-specific)
# - logs/inference.log (inference-specific)

๐Ÿ—๏ธ Architecture Overview

Project Structure

calcgpt/
โ”œโ”€โ”€ lib/                           # Core library package
โ”‚   โ”œโ”€โ”€ __init__.py               # Unified exports
โ”‚   โ”œโ”€โ”€ datagen.py               # Dataset generation
โ”‚   โ”œโ”€โ”€ tokenizer.py             # Dual-mode tokenization system
โ”‚   โ”œโ”€โ”€ train.py                 # Model training
โ”‚   โ”œโ”€โ”€ inference.py             # Model inference
โ”‚   โ”œโ”€โ”€ evaluation.py            # Model evaluation
โ”‚   โ”œโ”€โ”€ logger.py                # Comprehensive logging system
โ”‚   โ””โ”€โ”€ README.md                # Library documentation
โ”œโ”€โ”€ examples/                    # Example scripts
โ”‚   โ””โ”€โ”€ complete_workflow.py    # Complete end-to-end example
โ”œโ”€โ”€ calcgpt_dategen.py           # Dataset generation CLI
โ”œโ”€โ”€ calcgpt_train.py             # Model training CLI
โ”œโ”€โ”€ calcgpt_eval.py              # Model evaluation CLI
โ”œโ”€โ”€ calcgpt.py                   # Interactive inference CLI
โ”œโ”€โ”€ calcgpt.ipynb               # Comprehensive tutorial notebook
โ”œโ”€โ”€ datasets/                   # Generated datasets
โ”œโ”€โ”€ models/                     # Trained models
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ””โ”€โ”€ README.md                   # This file

Core Components

๐ŸŽฏ DatasetGenerator

  • Generates systematic arithmetic datasets
  • Supports multiple operations (addition, subtraction)
  • Intelligent filename encoding with parameters
  • Built-in data augmentation (commutative property)
  • Comprehensive dataset analysis

๐Ÿ”ค CalcGPTTokenizer

  • Dual tokenization modes: character-level and number-level (0-99)
  • Character mode: Individual characters as tokens (efficient vocab)
  • Number mode: Whole numbers as tokens (semantic understanding)
  • Automatic mode selection and vocabulary optimization
  • Simplified, focused API for arithmetic expressions

๐Ÿ‹๏ธ CalcGPTTrainer

  • Advanced transformer model training
  • Automatic architecture optimization
  • Intelligent model naming based on configuration
  • Built-in validation and checkpointing
  • Comprehensive training metrics and testing

๐Ÿ” CalcGPTEvaluator

  • Multi-dimensional model assessment
  • Three test types: first_operand, expression_complete, answer_complete
  • Format validation and arithmetic correctness checking
  • Performance timing analysis
  • Detailed statistical reporting

๐Ÿš€ CalcGPT

  • High-performance model inference
  • Temperature-controlled generation
  • Batch processing capabilities
  • Multiple output formats
  • Production-ready error handling

๐Ÿ“ CalcGPTLogger

  • Comprehensive logging system with high traceability
  • Component-specific log files (train.log, inference.log, etc.)
  • Colored console output with different levels
  • Detailed file logging with timestamps, thread IDs, and module info
  • Performance monitoring decorators and convenience functions
  • Automatic log rotation and configurable levels

๐Ÿ“Š Examples & Tutorials

๐ŸŽ“ Interactive Tutorial

The calcgpt.ipynb notebook provides a comprehensive, step-by-step tutorial covering:

  • Transformer Architecture: Understanding GPT-2 models and attention mechanisms
  • Dataset Engineering: Creating and analyzing training datasets
  • Model Training: From tiny models (38K params) to production (1.2M+ params)
  • Evaluation Methodologies: Comprehensive model assessment
  • Production Deployment: Real-world inference and usage patterns
  • Library Integration: Using both programmatic and CLI interfaces
# Launch the tutorial
jupyter notebook calcgpt.ipynb

๐Ÿ’ก Example Scripts

Explore the examples/ directory for practical usage demonstrations:

# Run complete end-to-end workflow example
python examples/complete_workflow.py

This comprehensive example demonstrates:

  • Dataset generation with custom configurations
  • Model training with validation
  • Model evaluation with detailed metrics
  • Interactive inference and testing
  • Complete workflow from data to deployment

๐ŸŽฎ End-to-End Workflow

from lib import *
from pathlib import Path

# 0. Setup logging (optional but recommended)
setup_logging(console_level="INFO", file_level="DEBUG")

# 1. Generate dataset
dataset_config = DatagenConfig(max_value=50, max_expressions=5000)
generator = DatasetGenerator(dataset_config)
dataset_path = generator.generate()

# 2. Train model with number-level tokenization
train_config = TrainingConfig(epochs=20, embedding_dim=128, num_layers=4)
trainer = CalcGPTTrainer(train_config, dataset_path, Path("models/demo"))
results = trainer.train()

# 3. Evaluate model
eval_config = EvaluationConfig(sample_size=200)
evaluator = CalcGPTEvaluator(eval_config, "models/demo", dataset_path)
eval_results = evaluator.evaluate()

# 4. Use for inference
inference_config = InferenceConfig(temperature=0.0)
model = CalcGPT(inference_config, "models/demo")
prediction = model.generate("25+25=")
print(f"25+25 = {prediction['completion']}")

# 5. Check logs for detailed traceability
# See logs/calcgpt.log, logs/train.log, logs/inference.log

๐Ÿ”ง Advanced Configuration

Tokenization Mode Selection

# Character-level tokenization (default) - smaller vocab, longer sequences
CalcGPTTokenizer(examples, mode='char')     # ~15 tokens vocab
CalcGPTTokenizer.from_dataset(mode='char')

# Number-level tokenization - larger vocab, shorter sequences  
CalcGPTTokenizer(examples, mode='number')   # ~105 tokens vocab
CalcGPTTokenizer.from_dataset(mode='number')

# Performance comparison for "12+34=46":
# Character mode: 8 tokens [1,2,+,3,4,=,4,6] 
# Number mode:   5 tokens [12,+,34,=,46]

Model Architecture Options

TrainingConfig(
    embedding_dim=256,      # Embedding dimension [32, 64, 128, 256, 512]
    num_layers=8,           # Number of transformer layers [1-12]
    num_heads=16,           # Number of attention heads [1-16]
    feedforward_dim=1024,   # Feedforward network dimension
    # embedding_dim must be divisible by num_heads
)

Training Hyperparameters

TrainingConfig(
    epochs=50,              # Training epochs
    batch_size=16,          # Training batch size
    learning_rate=1e-4,     # Learning rate
    weight_decay=0.01,      # L2 regularization
    warmup_steps=100,       # Learning rate warmup
    test_split=0.2,         # Validation split ratio
    save_steps=1000,        # Checkpoint frequency
)

Dataset Configuration

DatagenConfig(
    min_value=0,                              # Minimum operand value
    max_value=100,                            # Maximum operand value
    operations=['addition', 'subtraction'],   # Operations to include
    max_expressions=10000,                    # Maximum number of expressions
    allowed_digits='all',                     # Digit constraints
    verbose=True                              # Progress reporting
)

Logging Configuration

# Basic setup
setup_logging()  # Uses defaults: INFO console, DEBUG file

# Custom setup
setup_logging(
    logs_dir="custom_logs",           # Log directory
    console_level="DEBUG",            # Console verbosity
    file_level="DEBUG"                # File verbosity
)

# Component-specific logging
train_logger = get_logger('train')      # Creates logs/train.log
inference_logger = get_logger('inference')  # Creates logs/inference.log

# Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
# File features: automatic rotation (10MB), 5 backups, UTF-8 encoding
# Console features: colored output, timestamps, function:line info

๐Ÿ“ˆ Performance & Benchmarks

Model Performance by Architecture

Architecture Parameters Training Time Accuracy Use Case
Tiny (32d, 1L, 2H) 38K 30 seconds 60-80% Learning & prototyping
Small (64d, 3L, 4H) 180K 2 minutes 80-90% Development & testing
Medium (128d, 6L, 8H) 1.2M 10 minutes 90-95% Production ready
Large (256d, 8L, 16H) 4.8M 30 minutes 95-98% High accuracy needs

Evaluation Metrics

  • Format Validity: Does the output follow number+number=result format?
  • Arithmetic Correctness: Is the mathematical result correct?
  • Complete Expressions: Does the model generate complete, valid expressions?
  • Inference Speed: Average time per prediction (typically 10-50ms)
  • Tokenization Efficiency: Character vs number mode sequence length impact

Scaling Guidelines

  • For learning: Start with tiny models (38K parameters) + character tokenization
  • For development: Use small to medium models (180K-1.2M parameters)
  • For production: Medium to large models (1.2M-4.8M parameters) + number tokenization
  • For research: Large models with custom architectures + experiment with tokenization modes

Tokenization Mode Guidelines

  • Character mode: Better for learning transformer mechanics, smaller vocabulary
  • Number mode: Better for arithmetic understanding, more efficient sequences
  • Experimentation: Compare both modes for your specific use case and data range

๐Ÿค Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

# Clone and setup development environment
git clone https://github.com/yourusername/calcgpt.git
cd calcgpt
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt  # Additional dev dependencies

Code Standards

  • Follow PEP 8 style guidelines
  • Add type hints for all functions
  • Include comprehensive docstrings
  • Write unit tests for new functionality
  • Update documentation for API changes

Testing

# Run unit tests
python -m pytest tests/

# Run integration tests
python -m pytest tests/ --integration

# Test CLI tools
python tests/test_cli.py

Pull Request Process

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • HuggingFace Transformers: For the excellent transformer library
  • PyTorch: For the deep learning framework
  • OpenAI: For the original GPT architecture inspiration
  • The Open Source Community: For continuous inspiration and support

๐Ÿ“š Citation

If you use CalcGPT in your research, please cite:

@software{calcgpt2024,
  title={CalcGPT: Transformer-Based Arithmetic Language Models},
  author={Mihai NADAS},
  year={2025},
  url={https://github.com/mihainadas/calcgpt}
}

๐Ÿ”— Related Projects


Built with โค๏ธ for the AI/ML community

For questions, issues, or contributions, please visit our GitHub repository or open an issue.

About

A GPT-2 based language model specifically trained for arithmetic operations, featuring dataset generation, model training, evaluation, and interactive CLI tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors