frontier-scaling

Empirical AI scaling analysis and capability forecasting. Fits scaling laws to frontier model benchmark data, detects capability emergence patterns, projects compute trends, and generates publication-quality visualizations and research reports.

Overview

This project analyzes how AI capabilities scale with training compute. It curates a dataset of 49 frontier models (2018-2025), fits power law and logistic scaling curves to benchmark performance, detects sharp capability transitions, and forecasts when specific capability thresholds will be crossed.

Architecture

frontier-scaling/
  data/            # Curated model dataset, benchmark metadata, compute trends
  pipeline/        # Data loading, validation, preprocessing
  analysis/        # Scaling laws, emergence detection, forecasting, compute trends
  viz/             # Publication-quality matplotlib charts
  report/          # Research report generator
  tests/           # 31 passing tests
  cli.py           # CLI with demo, analyze, forecast, plot, report, data commands

Dataset

49 models from 12 organizations spanning 2018-2025:

GPT series (GPT-2 through GPT-4o)
Claude series (Claude 1 through Claude 4)
Llama series (LLaMA-7B through Llama 3 405B)
Gemini, PaLM, Mistral, Mixtral, BLOOM, Falcon, Phi, DeepSeek, and more

10 benchmarks: MMLU, HellaSwag, HumanEval, GSM8K, ARC-Challenge, WinoGrande, TruthfulQA, BIG-Bench Hard, MATH, MT-Bench

Compute data: 26 landmark models from AlexNet (2012, 10^17 FLOP) through frontier models (2025, ~10^26 FLOP)

Data sources include published papers, official announcements, and Epoch AI estimates. Unknown values are null, estimates are clearly marked.

Analysis Capabilities

Scaling Law Fitting

Power law: score = a * compute^b + c
Logistic: score = L / (1 + exp(-k*(x - x0))) for capability saturation
Chinchilla: L(N,D) = A/N^alpha + B/D^beta + E with optimal compute allocation

All fitting uses scipy.optimize.curve_fit with a Levenberg-Marquardt fallback. R-squared, residuals, and 95% confidence intervals on parameters.

Capability Emergence Detection

Piecewise-linear breakpoint analysis classifies each benchmark's scaling behavior:

Gradual — smooth monotonic improvement
Sharp — noticeable acceleration at a compute threshold
Phase transition — near-random performance followed by sudden capability
Saturating — approaching a performance ceiling

Forecasting

Combines scaling law fits with compute growth trends to project:

Benchmark scores at future compute levels with 80% confidence intervals
Time-to-threshold for specific capability targets
Human-performance timeline per benchmark
Dangerous capability thresholds (autonomous coding, expert reasoning, mathematical proof)

Compute Trend Analysis

Training compute doubling time (overall and per-era)
Cost-per-FLOP trends
Algorithmic efficiency analysis (which models beat the scaling curve)
Projections with uncertainty bands

Visualizations

Seven publication-quality chart types:

Scaling curves — log-compute vs benchmark score with fitted curves
Compute timeline — training FLOP over time, colored by era
Capability trajectories — multiple benchmarks over time, normalized
Emergence map — color-coded emergence patterns per benchmark
Forecast dashboard — multi-panel projections with confidence bands
Efficiency frontier — Pareto frontier of compute efficiency
Human performance countdown — time to human-level per benchmark

Quick Start

# Install dependencies
pip install matplotlib scipy

# Run the full demo
python cli.py demo

# Explore the dataset
python cli.py data --stats

# Run forecasts
python cli.py forecast --benchmark mmlu --years 3

# Generate all plots
python cli.py plot --type all --output-dir output/

# Generate full research report
python cli.py report --output-dir output/

Testing

pytest tests/ -v  # 31 tests

Methodology Notes

Training compute values are in total FLOP (floating-point operations), not FLOP/s
Parameter counts use raw numbers (7e9, not "7B")
Benchmark scores are in native scale (typically 0-100 for accuracy)
Confidence intervals use parameter covariance from curve_fit, widened for extrapolation
Forecasts acknowledge uncertainty with explicit CI bounds
Insufficient data points for a benchmark result in skipped analysis, not fabricated results

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

frontier-scaling

Overview

Architecture

Dataset

Analysis Capabilities

Scaling Law Fitting

Capability Emergence Detection

Forecasting

Compute Trend Analysis

Visualizations

Quick Start

Testing

Methodology Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
data		data
pipeline		pipeline
report		report
tests		tests
viz		viz
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py

Folders and files

Latest commit

History

Repository files navigation

frontier-scaling

Overview

Architecture

Dataset

Analysis Capabilities

Scaling Law Fitting

Capability Emergence Detection

Forecasting

Compute Trend Analysis

Visualizations

Quick Start

Testing

Methodology Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages