A comparative study of 5 frontier AI agents as autonomous ML researchers.
We gave Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash the same ML research task using Karpathy's autoresearch framework. Each agent autonomously hypothesized, coded, trained, and evaluated changes to a neural language model over 163 experiments. 819 experiments later, the differences in how they approach research are striking.
| Rank | Agent | Best val_bpb | Experiments | Success Rate | Categories Explored |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 (1M context) | 0.5757 | 163 | 19.5% | 5 |
| 2 | Claude Sonnet 4.6 (1M context) | 0.5794 | 163 | 21.0% | 5 |
| 3 | Claude Opus 4.6 (200K context) | 0.5802 | 163 | 13.0% | 5 |
| 4 | Gemini 3 Flash | 0.5848 | 167 | 8.2% | 3 |
| 5 | GPT-5.4 | 0.5923 | 163 | 6.2% | 2 |
All agents started from the same baseline: val_bpb = 0.6315 (TinyStories dataset, 11.5M parameter model).
Strategy breadth is the strongest predictor of performance (Pearson r = -0.946). Agents that explored more categories of intervention — hyperparameters, architecture, optimizer, schedule, tokenizer — achieved better results. GPT-5.4 explored only 2 categories and finished last.
Context window size eliminates "Groundhog Day" loops. The same model (Opus 4.6) with 1M context achieved 5.2x fewer repeated experiments and 3.9x shorter plateaus compared to 200K context. At 200K, context compaction caused the agent to literally forget what it had already tried.
Distinct research personalities emerged. Sonnet was an "efficient generalist" with the highest success rate. Opus 1M was a "systematic methodologist" that ran deliberate Adam beta series. GPT-5.4 was "tunnel-visioned" — 99% same-category persistence after failure. Gemini collapsed into micro-optimization loops, adjusting learning rates by 0.04% increments.
The ability to pivot under failure separates top from bottom performers. Top agents switched strategy categories 56-65% of the time after a reverted experiment. GPT-5.4 switched 1% of the time.
| # | Hypothesis | Verdict |
|---|---|---|
| H1 | Strategy breadth predicts performance | Strongly supported (r = -0.946) |
| H2 | Strategy shifting when stuck is critical | Supported — pivot rates correlate with final ranking |
| H3 | Context window size affects research quality | Strongly supported — 5.2x fewer repeats at 1M vs 200K |
| H4 | Exploration efficiency varies dramatically | Supported — 3.7x gap between best and worst success rates |
| H5 | Late-stage micro-optimization differs | Supported — Gemini/GPT collapsed, Claude agents adapted |
| H6 | Specific technical discoveries separate performers | Supported — gradient clipping, Adam beta tuning were differentiators |
| H7 | Agent "personality" emerges in research strategy | Strongly supported — category entropy ranges from 0.71 to 2.10 bits |
Full analysis with charts and data in analysis/FINDINGS.md.
The full paper is in paper/main.pdf (LaTeX source in paper/main.tex), formatted for arXiv preprint submission.
O'Donnell, M. (2026). Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models. arXiv preprint.
autoresearch-opus/ # Claude Opus 4.6 (200K context) — 163 experiments
autoresearch-opus1mill/ # Claude Opus 4.6 (1M context) — 163 experiments
autoresearch-sonnet/ # Claude Sonnet 4.6 (1M context) — 163 experiments
autoresearch-GPT/ # GPT-5.4 via Codex CLI — 163 experiments
autoresearch-gemini/ # Gemini 3 Flash via Gemini CLI — 167 experiments
analysis/
analyze.py # Python analysis script (pandas, matplotlib, seaborn)
FINDINGS.md # Full findings with hypothesis test results
fig1_convergence.png # Running best val_bpb convergence curves
fig2_scatter.png # All experiments scatter (kept vs reverted)
fig3_categories.png # Category distribution per agent
fig4_category_timeline.png # Category exploration over time
fig5_success_rate.png # Rolling success rate
fig6_improvement_timeline.png # Personal best timeline
fig7_dry_streaks.png # Dry streak distribution
fig8_dashboard.png # Four-panel summary dashboard
paper/
main.tex # LaTeX source (arxiv preprint style)
main.pdf # Compiled paper
references.bib # Bibliography
figures/ # Publication-quality PDF figures
Setup Prompts/ # The exact prompts used to initialize each agent
RESEARCH-BRIEF.md # Research design document with 7 hypotheses
PUBLICATION-STRATEGY.md # Publication planning and framing notes
PROGRESS.md # Session log
Each autoresearch-* directory contains the agent's full run: experiments.csv (the raw data), train.py (as modified by the agent), program.md (the protocol), and STATUS.md (the agent's self-reported status).
Hardware: M4 Pro Mac Mini, 48GB unified memory (Apple Silicon)
Framework: miolini/autoresearch-macos (macOS fork of Karpathy's autoresearch)
Protocol: Each agent received identical conditions:
- Same codebase, dataset (TinyStories), and baseline model (depth=4, dim=256, 11.5M params)
- Same written protocol (
program.md) — hypothesis, implement, train (5 min), evaluate, keep/revert, repeat - Same CSV schema for logging experiments
- Full autonomy over what changes to make
Agents and CLI tools:
| Agent | Model ID | CLI | Context Window |
|---|---|---|---|
| Claude Opus 4.6 | claude-opus-4-6 | Claude Code | 200K (compacted) |
| Claude Opus 4.6 (1M) | claude-opus-4-6 | Claude Code | 1M |
| Claude Sonnet 4.6 | claude-sonnet-4-6 | Claude Code | 1M |
| GPT-5.4 | gpt-5.4 | Codex CLI | 272K |
| Gemini 3 Flash | gemini-3-flash-preview | Gemini CLI | 1M |
Clone and explore the data:
git clone https://github.com/matthewod11-stack/Dreamofhyperparameters.git
cd DreamofhyperparametersRun the analysis:
cd analysis
pip install pandas matplotlib seaborn numpy
python analyze.pyCompile the paper (requires a LaTeX distribution or tectonic):
cd paper
tectonic main.tex- Single task: All results are on TinyStories language modeling. Generalization is unknown.
- Gemini used Flash, not Pro. Results may not represent Gemini 3 Pro's capabilities.
- No statistical replication. Each agent was run once. Variance across runs is unknown.
- 5-minute training budget favors certain interventions over others.
- Single hardware platform (Apple Silicon). Results may differ on GPU clusters.
This research was AI-assisted throughout. The experiment was designed, controlled, and directed by a human researcher. The analysis code, visualizations, and paper draft were generated by Claude Code with human editorial direction. The setup prompts and research brief in this repo document the full chain of direction.
@article{odonnell2026llmagents,
title={Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models},
author={O'Donnell, Matthew},
year={2026},
note={arXiv preprint (forthcoming)}
}