Do LLM Agents Dream of Better Hyperparameters?

A comparative study of 5 frontier AI agents as autonomous ML researchers.

We gave Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash the same ML research task using Karpathy's autoresearch framework. Each agent autonomously hypothesized, coded, trained, and evaluated changes to a neural language model over 163 experiments. 819 experiments later, the differences in how they approach research are striking.

Key Results

Rank	Agent	Best val_bpb	Experiments	Success Rate	Categories Explored
1	Claude Opus 4.6 (1M context)	0.5757	163	19.5%	5
2	Claude Sonnet 4.6 (1M context)	0.5794	163	21.0%	5
3	Claude Opus 4.6 (200K context)	0.5802	163	13.0%	5
4	Gemini 3 Flash	0.5848	167	8.2%	3
5	GPT-5.4	0.5923	163	6.2%	2

All agents started from the same baseline: val_bpb = 0.6315 (TinyStories dataset, 11.5M parameter model).

What We Found

Strategy breadth is the strongest predictor of performance (Pearson r = -0.946). Agents that explored more categories of intervention — hyperparameters, architecture, optimizer, schedule, tokenizer — achieved better results. GPT-5.4 explored only 2 categories and finished last.

Context window size eliminates "Groundhog Day" loops. The same model (Opus 4.6) with 1M context achieved 5.2x fewer repeated experiments and 3.9x shorter plateaus compared to 200K context. At 200K, context compaction caused the agent to literally forget what it had already tried.

Distinct research personalities emerged. Sonnet was an "efficient generalist" with the highest success rate. Opus 1M was a "systematic methodologist" that ran deliberate Adam beta series. GPT-5.4 was "tunnel-visioned" — 99% same-category persistence after failure. Gemini collapsed into micro-optimization loops, adjusting learning rates by 0.04% increments.

The ability to pivot under failure separates top from bottom performers. Top agents switched strategy categories 56-65% of the time after a reverted experiment. GPT-5.4 switched 1% of the time.

Seven Hypotheses Tested

#	Hypothesis	Verdict
H1	Strategy breadth predicts performance	Strongly supported (r = -0.946)
H2	Strategy shifting when stuck is critical	Supported — pivot rates correlate with final ranking
H3	Context window size affects research quality	Strongly supported — 5.2x fewer repeats at 1M vs 200K
H4	Exploration efficiency varies dramatically	Supported — 3.7x gap between best and worst success rates
H5	Late-stage micro-optimization differs	Supported — Gemini/GPT collapsed, Claude agents adapted
H6	Specific technical discoveries separate performers	Supported — gradient clipping, Adam beta tuning were differentiators
H7	Agent "personality" emerges in research strategy	Strongly supported — category entropy ranges from 0.71 to 2.10 bits

Full analysis with charts and data in analysis/FINDINGS.md.

Paper

The full paper is in paper/main.pdf (LaTeX source in paper/main.tex), formatted for arXiv preprint submission.

O'Donnell, M. (2026). Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models. arXiv preprint.

Repo Structure

autoresearch-opus/          # Claude Opus 4.6 (200K context) — 163 experiments
autoresearch-opus1mill/     # Claude Opus 4.6 (1M context) — 163 experiments
autoresearch-sonnet/        # Claude Sonnet 4.6 (1M context) — 163 experiments
autoresearch-GPT/           # GPT-5.4 via Codex CLI — 163 experiments
autoresearch-gemini/        # Gemini 3 Flash via Gemini CLI — 167 experiments

analysis/
  analyze.py                # Python analysis script (pandas, matplotlib, seaborn)
  FINDINGS.md               # Full findings with hypothesis test results
  fig1_convergence.png      # Running best val_bpb convergence curves
  fig2_scatter.png          # All experiments scatter (kept vs reverted)
  fig3_categories.png       # Category distribution per agent
  fig4_category_timeline.png # Category exploration over time
  fig5_success_rate.png     # Rolling success rate
  fig6_improvement_timeline.png # Personal best timeline
  fig7_dry_streaks.png      # Dry streak distribution
  fig8_dashboard.png        # Four-panel summary dashboard

paper/
  main.tex                  # LaTeX source (arxiv preprint style)
  main.pdf                  # Compiled paper
  references.bib            # Bibliography
  figures/                  # Publication-quality PDF figures

Setup Prompts/              # The exact prompts used to initialize each agent
RESEARCH-BRIEF.md           # Research design document with 7 hypotheses
PUBLICATION-STRATEGY.md     # Publication planning and framing notes
PROGRESS.md                 # Session log

Each autoresearch-* directory contains the agent's full run: experiments.csv (the raw data), train.py (as modified by the agent), program.md (the protocol), and STATUS.md (the agent's self-reported status).

Methodology

Hardware: M4 Pro Mac Mini, 48GB unified memory (Apple Silicon)

Framework: miolini/autoresearch-macos (macOS fork of Karpathy's autoresearch)

Protocol: Each agent received identical conditions:

Same codebase, dataset (TinyStories), and baseline model (depth=4, dim=256, 11.5M params)
Same written protocol (program.md) — hypothesis, implement, train (5 min), evaluate, keep/revert, repeat
Same CSV schema for logging experiments
Full autonomy over what changes to make

Agents and CLI tools:

Agent	Model ID	CLI	Context Window
Claude Opus 4.6	claude-opus-4-6	Claude Code	200K (compacted)
Claude Opus 4.6 (1M)	claude-opus-4-6	Claude Code	1M
Claude Sonnet 4.6	claude-sonnet-4-6	Claude Code	1M
GPT-5.4	gpt-5.4	Codex CLI	272K
Gemini 3 Flash	gemini-3-flash-preview	Gemini CLI	1M

Reproduce

Clone and explore the data:

git clone https://github.com/matthewod11-stack/Dreamofhyperparameters.git
cd Dreamofhyperparameters

Run the analysis:

cd analysis
pip install pandas matplotlib seaborn numpy
python analyze.py

Compile the paper (requires a LaTeX distribution or tectonic):

cd paper
tectonic main.tex

Limitations

Single task: All results are on TinyStories language modeling. Generalization is unknown.
Gemini used Flash, not Pro. Results may not represent Gemini 3 Pro's capabilities.
No statistical replication. Each agent was run once. Variance across runs is unknown.
5-minute training budget favors certain interventions over others.
Single hardware platform (Apple Silicon). Results may differ on GPU clusters.

Transparency Note

This research was AI-assisted throughout. The experiment was designed, controlled, and directed by a human researcher. The analysis code, visualizations, and paper draft were generated by Claude Code with human editorial direction. The setup prompts and research brief in this repo document the full chain of direction.

Citation

@article{odonnell2026llmagents,
  title={Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models},
  author={O'Donnell, Matthew},
  year={2026},
  note={arXiv preprint (forthcoming)}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do LLM Agents Dream of Better Hyperparameters?

Key Results

What We Found

Seven Hypotheses Tested

Paper

Repo Structure

Methodology

Reproduce

Limitations

Transparency Note

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Setup Prompts		Setup Prompts
analysis		analysis
autoresearch-GPT		autoresearch-GPT
autoresearch-gemini		autoresearch-gemini
autoresearch-opus		autoresearch-opus
autoresearch-opus1mill		autoresearch-opus1mill
autoresearch-sonnet		autoresearch-sonnet
docs		docs
paper		paper
prompts		prompts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
EXPERIMENT-DESIGN-V2.md		EXPERIMENT-DESIGN-V2.md
LICENSE		LICENSE
PROGRESS.md		PROGRESS.md
PROJECT_STATE.md		PROJECT_STATE.md
PUBLICATION-STRATEGY.md		PUBLICATION-STRATEGY.md
README.md		README.md
RESEARCH-BRIEF.md		RESEARCH-BRIEF.md
claude-code-analysis-prompt.txt		claude-code-analysis-prompt.txt

Folders and files

Latest commit

History

Repository files navigation

Do LLM Agents Dream of Better Hyperparameters?

Key Results

What We Found

Seven Hypotheses Tested

Paper

Repo Structure

Methodology

Reproduce

Limitations

Transparency Note

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages