-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
autoresearchAutoresearch optimization loopAutoresearch optimization loopfindingAutoresearch discoveryAutoresearch discovery
Description
Summary
Dry-run autoresearch sweep across the model cluster (M18: model selection, temperature, top_p, max_tokens) using 30 iterations on board_meeting (seed=100) and 20 iterations across all 11 templates (seed=200). Total: 250 config evaluations with synthetic metrics.
Best Model Per Template (all-templates sweep)
| Template | Best Model | Quality | Cost |
|---|---|---|---|
| agent4_elk_migration | deepseek/deepseek-chat | 0.9030 | $0.24 |
| board_meeting | deepseek/deepseek-chat | 0.8911 | $0.24 |
| castaway_colony_branching | deepseek/deepseek-chat | 0.9062 | $0.25 |
| detective_prospection | llama-3.1-8b-instruct | 0.8956 | $0.04 |
| hospital_crisis | mistral-large-latest | 0.8631 | $0.02 |
| hound_shadow_directorial | llama-3.1-70b-instruct | 0.8763 | $0.16 |
| jefferson_dinner | llama-3.1-70b-instruct | 0.8919 | $0.20 |
| kami_shrine | qwen-2.5-72b-instruct | 0.8830 | $0.21 |
| mars_mission_portal | llama-3.1-8b-instruct | 0.8899 | $0.08 |
| sec_investigation | mistral-large-latest | 0.8893 | $0.04 |
| vc_pitch_branching | llama-3.1-70b-instruct | 0.8970 | $0.22 |
Aggregate Model Stats (220 evaluations across 11 templates)
| Model | N | Avg Quality | Avg Cost | Avg Efficiency |
|---|---|---|---|---|
| mistral-large-latest | 11 | 0.7644 | $0.03 | 36.13 |
| llama-3.1-8b-instruct | 55 | 0.7517 | $0.05 | 22.92 |
| deepseek/deepseek-chat | 55 | 0.7538 | $0.16 | 5.90 |
| llama-3.1-70b-instruct | 77 | 0.7516 | $0.24 | 4.84 |
| qwen-2.5-72b-instruct | 22 | 0.7704 | $0.19 | 4.12 |
Cost vs Quality Tradeoffs
Pareto frontier (all-templates, 6 optimal configs)
- mistral-large-latest (1015 tokens, temp=0.48): q=0.867, $0.01 — ultra-cheap tier
- llama-3.1-8b (1093 tokens, temp=0.78): q=0.878, $0.01 — best efficiency (87.78 eff)
- llama-3.1-8b (1093 tokens, temp=0.78): q=0.896, $0.04 — sweet spot
- llama-3.1-70b (3874 tokens, temp=0.43): q=0.897, $0.22 — diminishing returns begin
- deepseek-chat (6311 tokens, temp=0.58): q=0.903, $0.24 — peak quality
- deepseek-chat (6311 tokens, temp=0.58): q=0.906, $0.25 — marginal gain
Key board_meeting insight (seed=100)
- Best quality: llama-3.1-8b at temp=1.04, top_p=0.82 → q=0.9076, $0.018 (eff=49.3)
- Best efficiency: llama-3.1-8b at temp=0.61, top_p=0.98 → q=0.650, $0.01 (eff=65.0)
- The 8b model dominates the efficiency frontier — higher temperature (1.0+) compensates for smaller size
Recommendations
- Default routing:
llama-3.1-8b-instructwith temp=0.78-1.04, max_tokens=1000-2500 for ~90% of the quality at 5-20x lower cost - Quality-critical templates (vc_pitch_branching, jefferson_dinner): Route to
llama-3.1-70b-instructwith temp=0.43, top_p=0.85 - Budget templates (hospital_crisis, sec_investigation): Route to
mistral-large-latestwith temp=0.33-0.48, max_tokens=1000-1850 - Peak quality (castaway_colony_branching, agent4_elk_migration): Route to
deepseek-chatwith temp=0.58, max_tokens=6000+ - qwen-2.5-72b is the worst cost/quality tradeoff — avoid unless template-specific testing shows advantages (kami_shrine is the only clear win)
Method
- Harness:
autoresearch.pro_autoresearchdry-run mode with synthetic metrics - Config space: 4 dimensions (model, temperature, top_p, max_tokens)
- Seeds: 100 (board_meeting), 200 (all-templates)
- Result files committed on branch
autoresearch/pro/models
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
autoresearchAutoresearch optimization loopAutoresearch optimization loopfindingAutoresearch discoveryAutoresearch discovery