StackWise is a modular AI research framework for training, evaluating, and scaling both classical and diffusion-inspired Transformer architectures.
It provides a unified stack for encoder, decoder, and depth-as-time models, along with standardized datasets, training curricula, and evaluation harnesses.
- Unified Architecture: Supports both masked/causal LMs and diffusion-style denoisers.
- Flexible Training Regimes: Leftβright (capacity growth) or rightβleft (reverse-diffusion) curricula.
- Scalable Families: Tiny β XL for encoders (BERT, ModernBERT) and decoders (GPT, LLaMA).
- Compute-Matched Benchmarks: Fair scaling comparison under equal FLOP budgets.
- Modular Integration: Shared configs for datasets, models, and trainers.
- Research Ready: Designed for scaling-law and curriculum-based experiments.
The Ultimate Goal: Train a 70B parameter LLM under 1 H200 GPU comfortably, from scratch.
- Traditional Training: 70B models require at least 8+ H100 GPUs (β$200K+ hardware) for a decent run
- Memory Bottleneck: Standard training hits GPU memory limits
- Cost Barrier: Most researchers can't access multi-GPU clusters
- Single GPU Training: Train 70B models on 1 H200 GPU at elongated wall clocks
- Layer-wise Architecture: Progressive training with cached activations
- Bidirectional Learning: More efficient representation learning
- Memory Optimization: 10x+ memory reduction through smart caching
Traditional: [Input] β [Layer 1] β [Layer 2] β ... β [Layer N] β [Output]
StackWise: [Input] β [Layer 1] β Cache β [Layer 2] β Cache β ... β [Layer N] β [Output]
Read a detailed note on Depth-as-Time viewpoint here
Key Benefits:
- Memory Efficiency: Only one layer active at a time
- Progressive Learning: Each layer learns from previous cached activations
- Bidirectional Attention: Better context understanding during training
- Flexible Inference: Switch between causal (GPT) and bidirectional (BERT) modes
- Unified Training: Single framework for both Encoder and Decoder models
- Mixed Precision: Choose flexible precision formats for frozen trunks and the trainable parts
- Training Phase: Bidirectional attention (BERT-style) for efficient learning
- Fusion Phase: Progressive model assembly with optional fine-tuning
- Inference Phase: Causal attention (GPT-style) for autoregressive generation
Read the Block-Stack-Rack nomenclature here, which facilitates training models end-to-end or in progressive manner via different training curricula.
Rack (Complete Model)
βββ Stack 1 (4 Blocks)
β βββ Block 1 (Transformer Layer)
β βββ Block 2 (Transformer Layer)
β βββ Block 3 (Transformer Layer)
β βββ Block 4 (Transformer Layer)
βββ Stack 2 (4 Blocks)
β βββ Block 5 (Transformer Layer)
β βββ Block 6 (Transformer Layer)
β βββ Block 7 (Transformer Layer)
β βββ Block 8 (Transformer Layer)
βββ ... (More Stacks)
Key Innovation: This paradigm supports stack-wise training, where entire stacks can be trained independently, enabling:
- Memory Efficiency: Train one stack at a time
- Progressive Building: Add stacks incrementally
- Flexible Curriculum: Different training strategies per stack
StackWise unifies Encoder, Decoder, and Diffusion models through a single training framework:
- Masked Language Modeling (MLM): BERT-style bidirectional training
- Causal Language Modeling (CLM): GPT-style autoregressive training
- Diffusion Modeling: Depth-as-Time progressive denoising
- Unified Framework: Switch between MLM, CLM, and diffusion modes seamlessly
- Task Flexibility: Same model architecture for understanding, generation, and diffusion
StackWise supports two distinct curriculum approaches for building models:
- Focus: Progressive model capacity building
- Approach: Add new stacks to the right
- Benefit: Gradual complexity increase
- Use Case: Traditional model scaling
- Focus: Retain learned semantics while improving
- Approach: Add new stacks to the left, freeze rightmost stacks
- Benefit: Preserves learned representations
- Use Case: Incremental model improvement
- Modern Attention: GQA, MLA, and Kernel-based attention
- Quantization: FP4, FP8, FP16 support for memory efficiency
- QLoRA Integration: Low-rank adapters for efficient fine-tuning
- Progressive Training: Build models incrementally
- Mask-Diffusion: Variable masking (15%-90%) for better learning
- Traditional 70B: ~280GB GPU memory (8x H100)
- StackWise 70B: ~35GB GPU memory (1x H200)
- Memory Reduction: 8x improvement through layer-wise training
# Progressive training for 70B model
config = {
"model": {
"d_model": 8192, # 70B model dimensions
"n_heads": 64,
"d_ff": 28672,
"architecture": {
"n_stacks": 80, # 80 stacks
"blocks_per_stack": 1 # 1 block per stack = 80 layers
}
},
"training": {
"progressive": {
"enabled": True,
"trunk_strategy": "frozen", # Freeze previous layers
"new_stack_precision": "fp16", # Memory-efficient training
"cache_activations": True # Essential for layer-wise training
}
}
}# Create and activate virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -e .[advanced]# Navigate to examples
cd examples/gpt2_fusion
# Prepare data
python3 data_loader.py --prepare
# Train with layer-wise progressive training
python3 simple_train.py# Navigate to baselines
cd baselines
# Train a medium model
uv run python scripts/train.py model=encoder/bert_family/base
# Train with progressive training
uv run python scripts/train.py --config-name=experiments/bert_reproduction/bert_base_glueStatus: β Complete - All training modules and baselines framework implemented!
- Memory: Ultra-low (single layer at a time)
- Speed: Sequential but memory-efficient
- Use Case: Maximum memory efficiency, debugging
- Memory: Low (groups of layers)
- Speed: Faster than layer-wise
- Use Case: Balanced efficiency and speed
- Memory: Medium (entire stacks)
- Speed: Fast (stack-level training)
- Use Case: Progressive model building, curriculum learning
- Curriculum Support: Both left-to-right and right-to-left approaches
- Memory: Medium (progressive building)
- Speed: Fast (incremental building)
- Use Case: Large model training, research
- Curriculum Support: Flexible curriculum strategies
- Memory: High (multiple blocks)
- Speed: Variable (depends on frozen/trainable ratio)
- Use Case: Fine-tuning, production
- Single GPU: Train 70B models on 1 H200
- Progressive Building: Add layers incrementally
- Activation Caching: Smart memory management
- Single Framework: Train BERT, GPT, and diffusion models
- Flexible Objectives: Switch between MLM, CLM, and diffusion seamlessly
- Task Adaptation: Same architecture for understanding, generation, and diffusion
- Curriculum Learning: Progressive model building strategies
- Depth-as-Time: Revolutionary paradigm where depth equals reverse diffusion time
Stack 1 β Stack 2 β Stack 3 β ... β Stack N
- Approach: Add new stacks to the right
- Focus: Progressive capacity enhancement
- Benefit: Gradual complexity increase
- Use Case: Traditional model scaling
Stack N β Stack N-1 β ... β Stack 2 β Stack 1
- Approach: Add new stacks to the left, freeze rightmost
- Focus: Retain learned semantics while improving
- Benefit: Preserves learned representations
- Use Case: Incremental model improvement
- Bidirectional Training: Better representation learning
- Modern Attention: GQA, MLA, Kernel-based
- Flexible Inference: Switch between Autogressive next-token and at-once Diffusion
- Variable Masking: 15%-90% token masking
- Progressive Schedules: Time-as-depth training
- Mask-Diffusion: Token-level diffusion (not embedding noise)
- Read the Architecture Guide
- Try the Progressive Training Example
- Explore the Baselines Module
- Check the API Reference
- Read the Configuration Guide
- Run the Test Suite
- Review the Checkpointing Guide
- Configure for your use case
- Scale to your target model size
The StackWise Baselines module provides a comprehensive benchmarking framework for encoder-decoder model families with Hydra configuration management.
- Reproducible Baselines: BERT, GPT-2, and LLaMA family models
- Hydra Configuration: Hierarchical config management
- Comprehensive Evaluation: GLUE, language modeling, and reasoning tasks
- Experimental Tracking: Automated logging and result analysis
- Multi-run Support: Parameter sweeps and comparisons
# Navigate to baselines
cd baselines
# Train a tiny BERT model
uv run python scripts/train.py model=encoder/bert_family/tiny
# Run a complete experiment
uv run python scripts/train.py --config-name=experiments/bert_reproduction/bert_base_glue
# Learn about Hydra
python examples/hydra_simple_explanation.py# Mix and match components
uv run python scripts/train.py model=encoder/bert_family/base training=depth_time
# Override specific values
uv run python scripts/train.py model=encoder/bert_family/base model.d_model=512
# Run multiple experiments
uv run python scripts/train.py --multirun model=encoder/bert_family/tiny,encoder/bert_family/baseFor detailed documentation, see baselines/README.md.
- Depth-as-Time Design π§ Conceptual Breakthrough! - training paradigm supporting Encoder, Autoregressive Decoder, and Diffusion models
- Architecture Guide - Core architecture concepts
- Progressive Training - Advanced training strategies
- Configuration Guide - Configuration reference
- Checkpointing Guide - Checkpointing reference
- API Reference - API sketch
- Baselines Module - Benchmarking framework
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Diffusion Models for progressive denoising concepts
- Roberta-Diffusion for diffusion training inspirations
- DeepSeek-V2/V3 for MLA formulation
- BERT for bidirectional attention paradigm
- GPT for causal attention paradigm
Ready to revolutionize transformer training? Start with StackWise and train your first 70B model on a single GPU! π