autolab

Companion tools for Karpathy's autoresearch — the autonomous AI research framework for GPT pretraining.

Three CLIs that make the autoresearch experiment loop smarter: evaluate results statistically, get data-driven suggestions for what to try next, and run multi-agent competitions.

Install

pip install autojudge autosteer autoevolve

Requires Python >= 3.10 (matching autoresearch).

The Tools

autojudge — Smarter Experiment Evaluation

Replaces eyeballing val_bpb with statistical verdicts that account for noise floor, Pareto efficiency, and trend context.

autojudge --results results.tsv --run-log run.log

Experiment #14: val_bpb 3.91 → 3.87
  Verdict: KEEP (confidence: 82%)
  Delta: -1.01% (2.0x noise floor)
  Pareto frontier: yes
  Suggestion: Improvement looks real. Commit and continue.

Exit codes enable scripting: if autojudge --results results.tsv; then git commit -m "keep"; else git reset --hard HEAD~1; fi

autosteer — Research Direction Generator

Analyzes experiment history and suggests what to try next. Stops the random walk.

autosteer --results results.tsv

[1] [EXPLOIT] Tune learning rate warmup schedule          risk: low
    Rationale: Learning rate experiments have 3 keeps in 4 attempts.
[2] [EXPLORE] Try rotary position embeddings               risk: medium
    Rationale: Positional encoding category untested. High potential.
[3] [EXPLOIT] Increase batch size from 32 to 48           risk: low
    Rationale: Batch size increase worked well in experiment #12.

Strategy modes: auto (default) | explore (when stuck) | exploit (when winning)

autoevolve — Multi-Agent Competition

Run parallel AI agents with different strategies competing on the same problem.

autoevolve init --agents 3 --tag mar15    # Create 3 agent branches
autoevolve leaderboard                     # Who's winning?
autoevolve pollinate                       # Spread winning ideas

6 built-in strategies assigned round-robin: Architecture First, Hyperparams First, Optimizer First, Regularization First, Efficiency First, Radical.

The Experiment Loop

These tools plug into the standard autoresearch loop:

1. autosteer --results results.tsv         # Pick next experiment
2. Implement the suggestion in train.py
3. uv run train.py > run.log 2>&1          # Train
4. autojudge --results results.tsv --run-log run.log  # Evaluate
5. Keep or discard based on verdict
6. Repeat

For multi-agent competitions, each agent runs this loop independently on its own branch, and autoevolve pollinate spreads winning ideas between them.

AI Agent Integration

The skills/ directory contains Claude Code skill definitions that teach AI agents to use these tools autonomously:

Skill	Purpose
`autoresearch-evaluate`	Run autojudge after every experiment, interpret verdicts
`autoresearch-steer`	Use autosteer for guided experiment selection
`autoresearch-evolve`	Set up and manage multi-agent competitions

The templates/program-addon.md is a drop-in snippet you can append to your autoresearch program.md to integrate all three tools into the experiment loop.

Quick Reference

# Evaluate
autojudge --results results.tsv --run-log run.log    # Full evaluation
autojudge --results results.tsv --format json         # JSON output
autojudge --results results.tsv --quiet               # One-line verdict

# Steer
autosteer --results results.tsv                       # 5 suggestions, auto strategy
autosteer --results results.tsv --strategy explore    # Favor new directions
autosteer --results results.tsv --num-suggestions 10  # More suggestions

# Evolve
autoevolve init --agents 3 --tag TAG                   # Start competition
autoevolve status                                      # Quick overview
autoevolve leaderboard --detailed                      # Full analysis
autoevolve pollinate                                   # Cross-pollinate wins
autoevolve export --format json -o results.json        # Export data

All tools support --quiet for minimal output and --no-color for plain text (auto-disabled when piped).

Development

git clone https://github.com/dean0x/autolab.git
cd autolab
pip install -e ./auto-judge -e ./auto-steer -e ./auto-evolve
pip install pytest ruff
pytest

See CONTRIBUTING.md for guidelines.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
auto-evolve		auto-evolve
auto-judge		auto-judge
auto-steer		auto-steer
docs		docs
skills		skills
templates		templates
test-data		test-data
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autolab

Install

The Tools

autojudge — Smarter Experiment Evaluation

autosteer — Research Direction Generator

autoevolve — Multi-Agent Competition

The Experiment Loop

AI Agent Integration

Quick Reference

Development

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

autolab

Install

The Tools

autojudge — Smarter Experiment Evaluation

autosteer — Research Direction Generator

autoevolve — Multi-Agent Competition

The Experiment Loop

AI Agent Integration

Quick Reference

Development

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages