Skip to content

codernate92/frontier-scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

frontier-scaling

Empirical AI scaling analysis and capability forecasting. Fits scaling laws to frontier model benchmark data, detects capability emergence patterns, projects compute trends, and generates publication-quality visualizations and research reports.

Overview

This project analyzes how AI capabilities scale with training compute. It curates a dataset of 49 frontier models (2018-2025), fits power law and logistic scaling curves to benchmark performance, detects sharp capability transitions, and forecasts when specific capability thresholds will be crossed.

Architecture

frontier-scaling/
  data/            # Curated model dataset, benchmark metadata, compute trends
  pipeline/        # Data loading, validation, preprocessing
  analysis/        # Scaling laws, emergence detection, forecasting, compute trends
  viz/             # Publication-quality matplotlib charts
  report/          # Research report generator
  tests/           # 31 passing tests
  cli.py           # CLI with demo, analyze, forecast, plot, report, data commands

Dataset

49 models from 12 organizations spanning 2018-2025:

  • GPT series (GPT-2 through GPT-4o)
  • Claude series (Claude 1 through Claude 4)
  • Llama series (LLaMA-7B through Llama 3 405B)
  • Gemini, PaLM, Mistral, Mixtral, BLOOM, Falcon, Phi, DeepSeek, and more

10 benchmarks: MMLU, HellaSwag, HumanEval, GSM8K, ARC-Challenge, WinoGrande, TruthfulQA, BIG-Bench Hard, MATH, MT-Bench

Compute data: 26 landmark models from AlexNet (2012, 10^17 FLOP) through frontier models (2025, ~10^26 FLOP)

Data sources include published papers, official announcements, and Epoch AI estimates. Unknown values are null, estimates are clearly marked.

Analysis Capabilities

Scaling Law Fitting

  • Power law: score = a * compute^b + c
  • Logistic: score = L / (1 + exp(-k*(x - x0))) for capability saturation
  • Chinchilla: L(N,D) = A/N^alpha + B/D^beta + E with optimal compute allocation

All fitting uses scipy.optimize.curve_fit with a Levenberg-Marquardt fallback. R-squared, residuals, and 95% confidence intervals on parameters.

Capability Emergence Detection

Piecewise-linear breakpoint analysis classifies each benchmark's scaling behavior:

  • Gradual — smooth monotonic improvement
  • Sharp — noticeable acceleration at a compute threshold
  • Phase transition — near-random performance followed by sudden capability
  • Saturating — approaching a performance ceiling

Forecasting

Combines scaling law fits with compute growth trends to project:

  • Benchmark scores at future compute levels with 80% confidence intervals
  • Time-to-threshold for specific capability targets
  • Human-performance timeline per benchmark
  • Dangerous capability thresholds (autonomous coding, expert reasoning, mathematical proof)

Compute Trend Analysis

  • Training compute doubling time (overall and per-era)
  • Cost-per-FLOP trends
  • Algorithmic efficiency analysis (which models beat the scaling curve)
  • Projections with uncertainty bands

Visualizations

Seven publication-quality chart types:

  1. Scaling curves — log-compute vs benchmark score with fitted curves
  2. Compute timeline — training FLOP over time, colored by era
  3. Capability trajectories — multiple benchmarks over time, normalized
  4. Emergence map — color-coded emergence patterns per benchmark
  5. Forecast dashboard — multi-panel projections with confidence bands
  6. Efficiency frontier — Pareto frontier of compute efficiency
  7. Human performance countdown — time to human-level per benchmark

Quick Start

# Install dependencies
pip install matplotlib scipy

# Run the full demo
python cli.py demo

# Explore the dataset
python cli.py data --stats

# Run forecasts
python cli.py forecast --benchmark mmlu --years 3

# Generate all plots
python cli.py plot --type all --output-dir output/

# Generate full research report
python cli.py report --output-dir output/

Testing

pytest tests/ -v  # 31 tests

Methodology Notes

  • Training compute values are in total FLOP (floating-point operations), not FLOP/s
  • Parameter counts use raw numbers (7e9, not "7B")
  • Benchmark scores are in native scale (typically 0-100 for accuracy)
  • Confidence intervals use parameter covariance from curve_fit, widened for extrapolation
  • Forecasts acknowledge uncertainty with explicit CI bounds
  • Insufficient data points for a benchmark result in skipped analysis, not fabricated results

License

MIT

About

Empirical AI scaling analysis — scaling law fitting, capability emergence detection, compute trend forecasting, and publication-quality visualizations from curated frontier model data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages