Autoresearch Pro-5: Model routing optimization findings (M18)

## Summary

Dry-run autoresearch sweep across the **model** cluster (M18: model selection, temperature, top_p, max_tokens) using 30 iterations on `board_meeting` (seed=100) and 20 iterations across all 11 templates (seed=200). Total: 250 config evaluations with synthetic metrics.

## Best Model Per Template (all-templates sweep)

| Template | Best Model | Quality | Cost |
|---|---|---|---|
| agent4_elk_migration | deepseek/deepseek-chat | 0.9030 | $0.24 |
| board_meeting | deepseek/deepseek-chat | 0.8911 | $0.24 |
| castaway_colony_branching | deepseek/deepseek-chat | 0.9062 | $0.25 |
| detective_prospection | llama-3.1-8b-instruct | 0.8956 | $0.04 |
| hospital_crisis | mistral-large-latest | 0.8631 | $0.02 |
| hound_shadow_directorial | llama-3.1-70b-instruct | 0.8763 | $0.16 |
| jefferson_dinner | llama-3.1-70b-instruct | 0.8919 | $0.20 |
| kami_shrine | qwen-2.5-72b-instruct | 0.8830 | $0.21 |
| mars_mission_portal | llama-3.1-8b-instruct | 0.8899 | $0.08 |
| sec_investigation | mistral-large-latest | 0.8893 | $0.04 |
| vc_pitch_branching | llama-3.1-70b-instruct | 0.8970 | $0.22 |

## Aggregate Model Stats (220 evaluations across 11 templates)

| Model | N | Avg Quality | Avg Cost | Avg Efficiency |
|---|---|---|---|---|
| mistral-large-latest | 11 | 0.7644 | $0.03 | 36.13 |
| llama-3.1-8b-instruct | 55 | 0.7517 | $0.05 | 22.92 |
| deepseek/deepseek-chat | 55 | 0.7538 | $0.16 | 5.90 |
| llama-3.1-70b-instruct | 77 | 0.7516 | $0.24 | 4.84 |
| qwen-2.5-72b-instruct | 22 | 0.7704 | $0.19 | 4.12 |

## Cost vs Quality Tradeoffs

### Pareto frontier (all-templates, 6 optimal configs)
1. **mistral-large-latest** (1015 tokens, temp=0.48): q=0.867, $0.01 — ultra-cheap tier
2. **llama-3.1-8b** (1093 tokens, temp=0.78): q=0.878, $0.01 — best efficiency (87.78 eff)
3. **llama-3.1-8b** (1093 tokens, temp=0.78): q=0.896, $0.04 — sweet spot
4. **llama-3.1-70b** (3874 tokens, temp=0.43): q=0.897, $0.22 — diminishing returns begin
5. **deepseek-chat** (6311 tokens, temp=0.58): q=0.903, $0.24 — peak quality
6. **deepseek-chat** (6311 tokens, temp=0.58): q=0.906, $0.25 — marginal gain

### Key board_meeting insight (seed=100)
- **Best quality:** llama-3.1-8b at temp=1.04, top_p=0.82 → q=0.9076, $0.018 (eff=49.3)
- **Best efficiency:** llama-3.1-8b at temp=0.61, top_p=0.98 → q=0.650, $0.01 (eff=65.0)
- The 8b model dominates the efficiency frontier — higher temperature (1.0+) compensates for smaller size

## Recommendations

1. **Default routing:** `llama-3.1-8b-instruct` with temp=0.78-1.04, max_tokens=1000-2500 for ~90% of the quality at 5-20x lower cost
2. **Quality-critical templates** (vc_pitch_branching, jefferson_dinner): Route to `llama-3.1-70b-instruct` with temp=0.43, top_p=0.85
3. **Budget templates** (hospital_crisis, sec_investigation): Route to `mistral-large-latest` with temp=0.33-0.48, max_tokens=1000-1850
4. **Peak quality** (castaway_colony_branching, agent4_elk_migration): Route to `deepseek-chat` with temp=0.58, max_tokens=6000+
5. **qwen-2.5-72b** is the worst cost/quality tradeoff — avoid unless template-specific testing shows advantages (kami_shrine is the only clear win)

## Method
- Harness: `autoresearch.pro_autoresearch` dry-run mode with synthetic metrics
- Config space: 4 dimensions (model, temperature, top_p, max_tokens)
- Seeds: 100 (board_meeting), 200 (all-templates)
- Result files committed on branch `autoresearch/pro/models`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoresearch Pro-5: Model routing optimization findings (M18) #20

Summary

Best Model Per Template (all-templates sweep)

Aggregate Model Stats (220 evaluations across 11 templates)

Cost vs Quality Tradeoffs

Pareto frontier (all-templates, 6 optimal configs)

Key board_meeting insight (seed=100)

Recommendations

Method

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Template	Best Model	Quality	Cost
agent4_elk_migration	deepseek/deepseek-chat	0.9030	$0.24
board_meeting	deepseek/deepseek-chat	0.8911	$0.24
castaway_colony_branching	deepseek/deepseek-chat	0.9062	$0.25
detective_prospection	llama-3.1-8b-instruct	0.8956	$0.04
hospital_crisis	mistral-large-latest	0.8631	$0.02
hound_shadow_directorial	llama-3.1-70b-instruct	0.8763	$0.16
jefferson_dinner	llama-3.1-70b-instruct	0.8919	$0.20
kami_shrine	qwen-2.5-72b-instruct	0.8830	$0.21
mars_mission_portal	llama-3.1-8b-instruct	0.8899	$0.08
sec_investigation	mistral-large-latest	0.8893	$0.04
vc_pitch_branching	llama-3.1-70b-instruct	0.8970	$0.22

Model	N	Avg Quality	Avg Cost	Avg Efficiency
mistral-large-latest	11	0.7644	$0.03	36.13
llama-3.1-8b-instruct	55	0.7517	$0.05	22.92
deepseek/deepseek-chat	55	0.7538	$0.16	5.90
llama-3.1-70b-instruct	77	0.7516	$0.24	4.84
qwen-2.5-72b-instruct	22	0.7704	$0.19	4.12

Autoresearch Pro-5: Model routing optimization findings (M18) #20

Description

Summary

Best Model Per Template (all-templates sweep)

Aggregate Model Stats (220 evaluations across 11 templates)

Cost vs Quality Tradeoffs

Pareto frontier (all-templates, 6 optimal configs)

Key board_meeting insight (seed=100)

Recommendations

Method

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions