Companion tools for Karpathy's autoresearch — the autonomous AI research framework for GPT pretraining.
Three CLIs that make the autoresearch experiment loop smarter: evaluate results statistically, get data-driven suggestions for what to try next, and run multi-agent competitions.
pip install autojudge autosteer autoevolveRequires Python >= 3.10 (matching autoresearch).
Replaces eyeballing val_bpb with statistical verdicts that account for noise floor, Pareto efficiency, and trend context.
autojudge --results results.tsv --run-log run.logExperiment #14: val_bpb 3.91 → 3.87
Verdict: KEEP (confidence: 82%)
Delta: -1.01% (2.0x noise floor)
Pareto frontier: yes
Suggestion: Improvement looks real. Commit and continue.
Verdicts: STRONG_KEEP | KEEP | MARGINAL | RETEST | DISCARD | CRASH
Exit codes enable scripting: if autojudge --results results.tsv; then git commit -m "keep"; else git reset --hard HEAD~1; fi
Analyzes experiment history and suggests what to try next. Stops the random walk.
autosteer --results results.tsv[1] [EXPLOIT] Tune learning rate warmup schedule risk: low
Rationale: Learning rate experiments have 3 keeps in 4 attempts.
[2] [EXPLORE] Try rotary position embeddings risk: medium
Rationale: Positional encoding category untested. High potential.
[3] [EXPLOIT] Increase batch size from 32 to 48 risk: low
Rationale: Batch size increase worked well in experiment #12.
Strategy modes: auto (default) | explore (when stuck) | exploit (when winning)
Run parallel AI agents with different strategies competing on the same problem.
autoevolve init --agents 3 --tag mar15 # Create 3 agent branches
autoevolve leaderboard # Who's winning?
autoevolve pollinate # Spread winning ideas6 built-in strategies assigned round-robin: Architecture First, Hyperparams First, Optimizer First, Regularization First, Efficiency First, Radical.
These tools plug into the standard autoresearch loop:
1. autosteer --results results.tsv # Pick next experiment
2. Implement the suggestion in train.py
3. uv run train.py > run.log 2>&1 # Train
4. autojudge --results results.tsv --run-log run.log # Evaluate
5. Keep or discard based on verdict
6. Repeat
For multi-agent competitions, each agent runs this loop independently on its own branch, and autoevolve pollinate spreads winning ideas between them.
The skills/ directory contains Claude Code skill definitions that teach AI agents to use these tools autonomously:
| Skill | Purpose |
|---|---|
autoresearch-evaluate |
Run autojudge after every experiment, interpret verdicts |
autoresearch-steer |
Use autosteer for guided experiment selection |
autoresearch-evolve |
Set up and manage multi-agent competitions |
The templates/program-addon.md is a drop-in snippet you can append to your autoresearch program.md to integrate all three tools into the experiment loop.
# Evaluate
autojudge --results results.tsv --run-log run.log # Full evaluation
autojudge --results results.tsv --format json # JSON output
autojudge --results results.tsv --quiet # One-line verdict
# Steer
autosteer --results results.tsv # 5 suggestions, auto strategy
autosteer --results results.tsv --strategy explore # Favor new directions
autosteer --results results.tsv --num-suggestions 10 # More suggestions
# Evolve
autoevolve init --agents 3 --tag TAG # Start competition
autoevolve status # Quick overview
autoevolve leaderboard --detailed # Full analysis
autoevolve pollinate # Cross-pollinate wins
autoevolve export --format json -o results.json # Export dataAll tools support --quiet for minimal output and --no-color for plain text (auto-disabled when piped).
git clone https://github.com/dean0x/autolab.git
cd autolab
pip install -e ./auto-judge -e ./auto-steer -e ./auto-evolve
pip install pytest ruff
pytestSee CONTRIBUTING.md for guidelines.