Skip to content

Autoresearch Pro-5: Model routing optimization findings (M18) #20

@realityinspector

Description

@realityinspector

Summary

Dry-run autoresearch sweep across the model cluster (M18: model selection, temperature, top_p, max_tokens) using 30 iterations on board_meeting (seed=100) and 20 iterations across all 11 templates (seed=200). Total: 250 config evaluations with synthetic metrics.

Best Model Per Template (all-templates sweep)

Template Best Model Quality Cost
agent4_elk_migration deepseek/deepseek-chat 0.9030 $0.24
board_meeting deepseek/deepseek-chat 0.8911 $0.24
castaway_colony_branching deepseek/deepseek-chat 0.9062 $0.25
detective_prospection llama-3.1-8b-instruct 0.8956 $0.04
hospital_crisis mistral-large-latest 0.8631 $0.02
hound_shadow_directorial llama-3.1-70b-instruct 0.8763 $0.16
jefferson_dinner llama-3.1-70b-instruct 0.8919 $0.20
kami_shrine qwen-2.5-72b-instruct 0.8830 $0.21
mars_mission_portal llama-3.1-8b-instruct 0.8899 $0.08
sec_investigation mistral-large-latest 0.8893 $0.04
vc_pitch_branching llama-3.1-70b-instruct 0.8970 $0.22

Aggregate Model Stats (220 evaluations across 11 templates)

Model N Avg Quality Avg Cost Avg Efficiency
mistral-large-latest 11 0.7644 $0.03 36.13
llama-3.1-8b-instruct 55 0.7517 $0.05 22.92
deepseek/deepseek-chat 55 0.7538 $0.16 5.90
llama-3.1-70b-instruct 77 0.7516 $0.24 4.84
qwen-2.5-72b-instruct 22 0.7704 $0.19 4.12

Cost vs Quality Tradeoffs

Pareto frontier (all-templates, 6 optimal configs)

  1. mistral-large-latest (1015 tokens, temp=0.48): q=0.867, $0.01 — ultra-cheap tier
  2. llama-3.1-8b (1093 tokens, temp=0.78): q=0.878, $0.01 — best efficiency (87.78 eff)
  3. llama-3.1-8b (1093 tokens, temp=0.78): q=0.896, $0.04 — sweet spot
  4. llama-3.1-70b (3874 tokens, temp=0.43): q=0.897, $0.22 — diminishing returns begin
  5. deepseek-chat (6311 tokens, temp=0.58): q=0.903, $0.24 — peak quality
  6. deepseek-chat (6311 tokens, temp=0.58): q=0.906, $0.25 — marginal gain

Key board_meeting insight (seed=100)

  • Best quality: llama-3.1-8b at temp=1.04, top_p=0.82 → q=0.9076, $0.018 (eff=49.3)
  • Best efficiency: llama-3.1-8b at temp=0.61, top_p=0.98 → q=0.650, $0.01 (eff=65.0)
  • The 8b model dominates the efficiency frontier — higher temperature (1.0+) compensates for smaller size

Recommendations

  1. Default routing: llama-3.1-8b-instruct with temp=0.78-1.04, max_tokens=1000-2500 for ~90% of the quality at 5-20x lower cost
  2. Quality-critical templates (vc_pitch_branching, jefferson_dinner): Route to llama-3.1-70b-instruct with temp=0.43, top_p=0.85
  3. Budget templates (hospital_crisis, sec_investigation): Route to mistral-large-latest with temp=0.33-0.48, max_tokens=1000-1850
  4. Peak quality (castaway_colony_branching, agent4_elk_migration): Route to deepseek-chat with temp=0.58, max_tokens=6000+
  5. qwen-2.5-72b is the worst cost/quality tradeoff — avoid unless template-specific testing shows advantages (kami_shrine is the only clear win)

Method

  • Harness: autoresearch.pro_autoresearch dry-run mode with synthetic metrics
  • Config space: 4 dimensions (model, temperature, top_p, max_tokens)
  • Seeds: 100 (board_meeting), 200 (all-templates)
  • Result files committed on branch autoresearch/pro/models

Metadata

Metadata

Assignees

No one assigned

    Labels

    autoresearchAutoresearch optimization loopfindingAutoresearch discovery

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions