Skip to content

dean0x/autolab

autolab

Companion tools for Karpathy's autoresearch — the autonomous AI research framework for GPT pretraining.

Three CLIs that make the autoresearch experiment loop smarter: evaluate results statistically, get data-driven suggestions for what to try next, and run multi-agent competitions.

CI License: MIT Python 3.10+

Install

pip install autojudge autosteer autoevolve

Requires Python >= 3.10 (matching autoresearch).

The Tools

autojudge — Smarter Experiment Evaluation

Replaces eyeballing val_bpb with statistical verdicts that account for noise floor, Pareto efficiency, and trend context.

autojudge --results results.tsv --run-log run.log
Experiment #14: val_bpb 3.91 → 3.87
  Verdict: KEEP (confidence: 82%)
  Delta: -1.01% (2.0x noise floor)
  Pareto frontier: yes
  Suggestion: Improvement looks real. Commit and continue.

Verdicts: STRONG_KEEP | KEEP | MARGINAL | RETEST | DISCARD | CRASH

Exit codes enable scripting: if autojudge --results results.tsv; then git commit -m "keep"; else git reset --hard HEAD~1; fi

autosteer — Research Direction Generator

Analyzes experiment history and suggests what to try next. Stops the random walk.

autosteer --results results.tsv
[1] [EXPLOIT] Tune learning rate warmup schedule          risk: low
    Rationale: Learning rate experiments have 3 keeps in 4 attempts.
[2] [EXPLORE] Try rotary position embeddings               risk: medium
    Rationale: Positional encoding category untested. High potential.
[3] [EXPLOIT] Increase batch size from 32 to 48           risk: low
    Rationale: Batch size increase worked well in experiment #12.

Strategy modes: auto (default) | explore (when stuck) | exploit (when winning)

autoevolve — Multi-Agent Competition

Run parallel AI agents with different strategies competing on the same problem.

autoevolve init --agents 3 --tag mar15    # Create 3 agent branches
autoevolve leaderboard                     # Who's winning?
autoevolve pollinate                       # Spread winning ideas

6 built-in strategies assigned round-robin: Architecture First, Hyperparams First, Optimizer First, Regularization First, Efficiency First, Radical.

The Experiment Loop

These tools plug into the standard autoresearch loop:

1. autosteer --results results.tsv         # Pick next experiment
2. Implement the suggestion in train.py
3. uv run train.py > run.log 2>&1          # Train
4. autojudge --results results.tsv --run-log run.log  # Evaluate
5. Keep or discard based on verdict
6. Repeat

For multi-agent competitions, each agent runs this loop independently on its own branch, and autoevolve pollinate spreads winning ideas between them.

AI Agent Integration

The skills/ directory contains Claude Code skill definitions that teach AI agents to use these tools autonomously:

Skill Purpose
autoresearch-evaluate Run autojudge after every experiment, interpret verdicts
autoresearch-steer Use autosteer for guided experiment selection
autoresearch-evolve Set up and manage multi-agent competitions

The templates/program-addon.md is a drop-in snippet you can append to your autoresearch program.md to integrate all three tools into the experiment loop.

Quick Reference

# Evaluate
autojudge --results results.tsv --run-log run.log    # Full evaluation
autojudge --results results.tsv --format json         # JSON output
autojudge --results results.tsv --quiet               # One-line verdict

# Steer
autosteer --results results.tsv                       # 5 suggestions, auto strategy
autosteer --results results.tsv --strategy explore    # Favor new directions
autosteer --results results.tsv --num-suggestions 10  # More suggestions

# Evolve
autoevolve init --agents 3 --tag TAG                   # Start competition
autoevolve status                                      # Quick overview
autoevolve leaderboard --detailed                      # Full analysis
autoevolve pollinate                                   # Cross-pollinate wins
autoevolve export --format json -o results.json        # Export data

All tools support --quiet for minimal output and --no-color for plain text (auto-disabled when piped).

Development

git clone https://github.com/dean0x/autolab.git
cd autolab
pip install -e ./auto-judge -e ./auto-steer -e ./auto-evolve
pip install pytest ruff
pytest

See CONTRIBUTING.md for guidelines.

License

MIT

About

Companion tools for Karpathy's autoresearch - smarter evaluation, guided steering, and multi-agent competitions for GPT pretraining

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Languages