Empirical AI scaling analysis and capability forecasting. Fits scaling laws to frontier model benchmark data, detects capability emergence patterns, projects compute trends, and generates publication-quality visualizations and research reports.
This project analyzes how AI capabilities scale with training compute. It curates a dataset of 49 frontier models (2018-2025), fits power law and logistic scaling curves to benchmark performance, detects sharp capability transitions, and forecasts when specific capability thresholds will be crossed.
frontier-scaling/
data/ # Curated model dataset, benchmark metadata, compute trends
pipeline/ # Data loading, validation, preprocessing
analysis/ # Scaling laws, emergence detection, forecasting, compute trends
viz/ # Publication-quality matplotlib charts
report/ # Research report generator
tests/ # 31 passing tests
cli.py # CLI with demo, analyze, forecast, plot, report, data commands
49 models from 12 organizations spanning 2018-2025:
- GPT series (GPT-2 through GPT-4o)
- Claude series (Claude 1 through Claude 4)
- Llama series (LLaMA-7B through Llama 3 405B)
- Gemini, PaLM, Mistral, Mixtral, BLOOM, Falcon, Phi, DeepSeek, and more
10 benchmarks: MMLU, HellaSwag, HumanEval, GSM8K, ARC-Challenge, WinoGrande, TruthfulQA, BIG-Bench Hard, MATH, MT-Bench
Compute data: 26 landmark models from AlexNet (2012, 10^17 FLOP) through frontier models (2025, ~10^26 FLOP)
Data sources include published papers, official announcements, and Epoch AI estimates. Unknown values are null, estimates are clearly marked.
- Power law: score = a * compute^b + c
- Logistic: score = L / (1 + exp(-k*(x - x0))) for capability saturation
- Chinchilla: L(N,D) = A/N^alpha + B/D^beta + E with optimal compute allocation
All fitting uses scipy.optimize.curve_fit with a Levenberg-Marquardt fallback. R-squared, residuals, and 95% confidence intervals on parameters.
Piecewise-linear breakpoint analysis classifies each benchmark's scaling behavior:
- Gradual — smooth monotonic improvement
- Sharp — noticeable acceleration at a compute threshold
- Phase transition — near-random performance followed by sudden capability
- Saturating — approaching a performance ceiling
Combines scaling law fits with compute growth trends to project:
- Benchmark scores at future compute levels with 80% confidence intervals
- Time-to-threshold for specific capability targets
- Human-performance timeline per benchmark
- Dangerous capability thresholds (autonomous coding, expert reasoning, mathematical proof)
- Training compute doubling time (overall and per-era)
- Cost-per-FLOP trends
- Algorithmic efficiency analysis (which models beat the scaling curve)
- Projections with uncertainty bands
Seven publication-quality chart types:
- Scaling curves — log-compute vs benchmark score with fitted curves
- Compute timeline — training FLOP over time, colored by era
- Capability trajectories — multiple benchmarks over time, normalized
- Emergence map — color-coded emergence patterns per benchmark
- Forecast dashboard — multi-panel projections with confidence bands
- Efficiency frontier — Pareto frontier of compute efficiency
- Human performance countdown — time to human-level per benchmark
# Install dependencies
pip install matplotlib scipy
# Run the full demo
python cli.py demo
# Explore the dataset
python cli.py data --stats
# Run forecasts
python cli.py forecast --benchmark mmlu --years 3
# Generate all plots
python cli.py plot --type all --output-dir output/
# Generate full research report
python cli.py report --output-dir output/pytest tests/ -v # 31 tests- Training compute values are in total FLOP (floating-point operations), not FLOP/s
- Parameter counts use raw numbers (7e9, not "7B")
- Benchmark scores are in native scale (typically 0-100 for accuracy)
- Confidence intervals use parameter covariance from curve_fit, widened for extrapolation
- Forecasts acknowledge uncertainty with explicit CI bounds
- Insufficient data points for a benchmark result in skipped analysis, not fabricated results
MIT