Skip to content

matthewod11-stack/Dreamofhyperparameters

Repository files navigation

Do LLM Agents Dream of Better Hyperparameters?

A comparative study of 5 frontier AI agents as autonomous ML researchers.

We gave Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, and Gemini 3 Flash the same ML research task using Karpathy's autoresearch framework. Each agent autonomously hypothesized, coded, trained, and evaluated changes to a neural language model over 163 experiments. 819 experiments later, the differences in how they approach research are striking.

Convergence curves for all 5 agents


Key Results

Rank Agent Best val_bpb Experiments Success Rate Categories Explored
1 Claude Opus 4.6 (1M context) 0.5757 163 19.5% 5
2 Claude Sonnet 4.6 (1M context) 0.5794 163 21.0% 5
3 Claude Opus 4.6 (200K context) 0.5802 163 13.0% 5
4 Gemini 3 Flash 0.5848 167 8.2% 3
5 GPT-5.4 0.5923 163 6.2% 2

All agents started from the same baseline: val_bpb = 0.6315 (TinyStories dataset, 11.5M parameter model).

What We Found

Strategy breadth is the strongest predictor of performance (Pearson r = -0.946). Agents that explored more categories of intervention — hyperparameters, architecture, optimizer, schedule, tokenizer — achieved better results. GPT-5.4 explored only 2 categories and finished last.

Context window size eliminates "Groundhog Day" loops. The same model (Opus 4.6) with 1M context achieved 5.2x fewer repeated experiments and 3.9x shorter plateaus compared to 200K context. At 200K, context compaction caused the agent to literally forget what it had already tried.

Distinct research personalities emerged. Sonnet was an "efficient generalist" with the highest success rate. Opus 1M was a "systematic methodologist" that ran deliberate Adam beta series. GPT-5.4 was "tunnel-visioned" — 99% same-category persistence after failure. Gemini collapsed into micro-optimization loops, adjusting learning rates by 0.04% increments.

The ability to pivot under failure separates top from bottom performers. Top agents switched strategy categories 56-65% of the time after a reverted experiment. GPT-5.4 switched 1% of the time.

Seven Hypotheses Tested

# Hypothesis Verdict
H1 Strategy breadth predicts performance Strongly supported (r = -0.946)
H2 Strategy shifting when stuck is critical Supported — pivot rates correlate with final ranking
H3 Context window size affects research quality Strongly supported — 5.2x fewer repeats at 1M vs 200K
H4 Exploration efficiency varies dramatically Supported — 3.7x gap between best and worst success rates
H5 Late-stage micro-optimization differs Supported — Gemini/GPT collapsed, Claude agents adapted
H6 Specific technical discoveries separate performers Supported — gradient clipping, Adam beta tuning were differentiators
H7 Agent "personality" emerges in research strategy Strongly supported — category entropy ranges from 0.71 to 2.10 bits

Full analysis with charts and data in analysis/FINDINGS.md.

Paper

The full paper is in paper/main.pdf (LaTeX source in paper/main.tex), formatted for arXiv preprint submission.

O'Donnell, M. (2026). Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models. arXiv preprint.

Repo Structure

autoresearch-opus/          # Claude Opus 4.6 (200K context) — 163 experiments
autoresearch-opus1mill/     # Claude Opus 4.6 (1M context) — 163 experiments
autoresearch-sonnet/        # Claude Sonnet 4.6 (1M context) — 163 experiments
autoresearch-GPT/           # GPT-5.4 via Codex CLI — 163 experiments
autoresearch-gemini/        # Gemini 3 Flash via Gemini CLI — 167 experiments

analysis/
  analyze.py                # Python analysis script (pandas, matplotlib, seaborn)
  FINDINGS.md               # Full findings with hypothesis test results
  fig1_convergence.png      # Running best val_bpb convergence curves
  fig2_scatter.png          # All experiments scatter (kept vs reverted)
  fig3_categories.png       # Category distribution per agent
  fig4_category_timeline.png # Category exploration over time
  fig5_success_rate.png     # Rolling success rate
  fig6_improvement_timeline.png # Personal best timeline
  fig7_dry_streaks.png      # Dry streak distribution
  fig8_dashboard.png        # Four-panel summary dashboard

paper/
  main.tex                  # LaTeX source (arxiv preprint style)
  main.pdf                  # Compiled paper
  references.bib            # Bibliography
  figures/                  # Publication-quality PDF figures

Setup Prompts/              # The exact prompts used to initialize each agent
RESEARCH-BRIEF.md           # Research design document with 7 hypotheses
PUBLICATION-STRATEGY.md     # Publication planning and framing notes
PROGRESS.md                 # Session log

Each autoresearch-* directory contains the agent's full run: experiments.csv (the raw data), train.py (as modified by the agent), program.md (the protocol), and STATUS.md (the agent's self-reported status).

Methodology

Hardware: M4 Pro Mac Mini, 48GB unified memory (Apple Silicon)

Framework: miolini/autoresearch-macos (macOS fork of Karpathy's autoresearch)

Protocol: Each agent received identical conditions:

  • Same codebase, dataset (TinyStories), and baseline model (depth=4, dim=256, 11.5M params)
  • Same written protocol (program.md) — hypothesis, implement, train (5 min), evaluate, keep/revert, repeat
  • Same CSV schema for logging experiments
  • Full autonomy over what changes to make

Agents and CLI tools:

Agent Model ID CLI Context Window
Claude Opus 4.6 claude-opus-4-6 Claude Code 200K (compacted)
Claude Opus 4.6 (1M) claude-opus-4-6 Claude Code 1M
Claude Sonnet 4.6 claude-sonnet-4-6 Claude Code 1M
GPT-5.4 gpt-5.4 Codex CLI 272K
Gemini 3 Flash gemini-3-flash-preview Gemini CLI 1M

Reproduce

Clone and explore the data:

git clone https://github.com/matthewod11-stack/Dreamofhyperparameters.git
cd Dreamofhyperparameters

Run the analysis:

cd analysis
pip install pandas matplotlib seaborn numpy
python analyze.py

Compile the paper (requires a LaTeX distribution or tectonic):

cd paper
tectonic main.tex

Limitations

  • Single task: All results are on TinyStories language modeling. Generalization is unknown.
  • Gemini used Flash, not Pro. Results may not represent Gemini 3 Pro's capabilities.
  • No statistical replication. Each agent was run once. Variance across runs is unknown.
  • 5-minute training budget favors certain interventions over others.
  • Single hardware platform (Apple Silicon). Results may differ on GPU clusters.

Transparency Note

This research was AI-assisted throughout. The experiment was designed, controlled, and directed by a human researcher. The analysis code, visualizations, and paper draft were generated by Claude Code with human editorial direction. The setup prompts and research brief in this repo document the full chain of direction.

Citation

@article{odonnell2026llmagents,
  title={Do LLM Agents Dream of Better Hyperparameters? A Comparative Study of Autonomous ML Research Across Frontier Models},
  author={O'Donnell, Matthew},
  year={2026},
  note={arXiv preprint (forthcoming)}
}

License

MIT

About

Do LLM Agents Dream of Better Hyperparameters? Comparing 5 frontier AI agents as autonomous ML researchers on Karpathy's autoresearch framework. 819 experiments, 7 hypotheses tested.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors