Autoresearch Pro-7: Cross-template generalization findings

## Summary

Pro-7 ran a full cross-template generalization sweep: 50 iterations x 11 templates (550 total configs), dry-run mode, seed 1100, 27-dimensional config space.

**Branch:** `autoresearch/pro/generalize`
**Results:** `autoresearch/results/dry_run_20260316_083804.jsonl`, `autoresearch/results/pareto_20260316_083804.json`

## Global Pareto Frontier (9 configs)

| Run ID | Quality | Cost | Causal Resolution |
|--------|---------|------|-------------------|
| dry_36149509754d | 0.6790 | $0.01 | 0.3968 |
| dry_623dac4cc243 | 0.7417 | $0.01 | 0.5113 |
| dry_ee6fb24eac5a | 0.8950 | $0.0148 | 0.7875 |
| dry_f06c229e5d91 | 0.8958 | $0.0604 | 0.7913 |
| dry_ff30f2622b09 | 0.9046 | $0.0635 | 0.8136 |
| dry_ec8d19ac2717 | 0.9067 | $0.0641 | 0.8138 |
| dry_fcf33023bb8c | 0.9096 | $0.0677 | 0.8277 |
| dry_f5c97eceaa3f | 0.9109 | $0.1610 | 0.8290 |
| dry_fcc07e5633c7 | 0.9169 | $0.2126 | 0.8543 |

**Best quality:** q=0.9169 (hospital_crisis template context)
**Best efficiency:** eff=74.17 (8b model at $0.01)

## Per-Template Performance

| Template | Best Q | Avg Q | Best CR |
|----------|--------|-------|---------|
| hospital_crisis | 0.9169 | 0.7751 | 0.8543 |
| mars_mission_portal | 0.9109 | 0.7489 | 0.8290 |
| hound_shadow_directorial | 0.9067 | 0.7507 | 0.8138 |
| board_meeting | 0.9058 | 0.7559 | 0.8150 |
| vc_pitch_branching | 0.9020 | 0.7582 | 0.8241 |
| detective_prospection | 0.9017 | 0.7657 | 0.8100 |
| kami_shrine | 0.9012 | 0.7487 | 0.8125 |
| agent4_elk_migration | 0.8964 | 0.7731 | 0.7974 |
| jefferson_dinner | 0.8958 | 0.7718 | 0.7952 |
| sec_investigation | 0.8917 | 0.7678 | 0.7949 |
| castaway_colony_branching | 0.8891 | 0.7587 | 0.7796 |

Quality ceiling is remarkably consistent across templates (0.889-0.917 range), suggesting the config space has strong universal optima.

## Universal Optimizations (work across ALL templates)

### 1. Compression: PCA dominates
- **PCA chosen in 8/11 per-template bests** (73%), SVD in 2, NMF in 1
- PCA is the safe universal default for tensor compression

### 2. Model: 70B-class models win on quality
- **Llama 3.1 70B: 5/11 templates**, Qwen 2.5 72B: 3/11
- 70B+ models account for 8/11 template bests
- However, 8B achieves the best cost-efficiency on the Pareto frontier ($0.01 runs)

### 3. High frequency penalty is universal
- Mean frequency_penalty across bests: **0.665** (range 0.025-0.907)
- 9/11 template bests use freq_penalty > 0.5
- This is the strongest universal signal: higher frequency penalty improves quality across all templates

### 4. Moderate presence penalty
- Mean presence_penalty: **0.492** (range 0.256-0.813)
- Less extreme than frequency penalty but consistently in the 0.3-0.8 range

## Template-Specific Optimizations

### 5. Temporal mode is template-dependent (NOT universal)
- **directorial:** 4/11 templates (hound_shadow, jefferson_dinner, kami_shrine, castaway_colony)
- **portal:** 4/11 templates (hospital_crisis, mars_mission, vc_pitch, board_meeting)
- **forward:** 3/11 templates (detective_prospection, sec_investigation, agent4_elk)
- No single temporal mode dominates. This is the primary template-specific dimension.

### 6. Temperature is bimodal
- Some templates prefer low temp (0.15-0.37), others high (0.80-0.98)
- Mean 0.580 with high std (0.323) -- NOT a universal setting
- Suggests narrative-heavy templates prefer higher temperature, analytical templates prefer lower

### 7. Max tokens varies widely
- Range: 1389-7869, mean 3807
- Token budget should be tuned per template complexity

## Recommended Universal Config

Based on cross-template analysis, these settings generalize well:
```
autoresearch.compression_method=pca
llm_service.defaults.model=meta-llama/llama-3.1-70b-instruct
llm_service.defaults.frequency_penalty=0.70
llm_service.defaults.presence_penalty=0.45
```

Template-specific tuning should focus on:
- `temporal_mode.active_mode` (directorial vs portal vs forward)
- `llm_service.defaults.temperature` (0.2 vs 0.8 depending on template type)
- `llm_service.defaults.max_tokens` (scale with template complexity)

## Next Steps
- [ ] Run live (non-dry-run) validation of top 3 Pareto configs
- [ ] Test if temporal mode correlates with template narrative structure (branching templates prefer portal/directorial?)
- [ ] Verify frequency_penalty finding holds with real API calls
- [ ] Create per-template config presets based on these findings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoresearch Pro-7: Cross-template generalization findings #25

Summary

Global Pareto Frontier (9 configs)

Per-Template Performance

Universal Optimizations (work across ALL templates)

1. Compression: PCA dominates

2. Model: 70B-class models win on quality

3. High frequency penalty is universal

4. Moderate presence penalty

Template-Specific Optimizations

5. Temporal mode is template-dependent (NOT universal)

6. Temperature is bimodal

7. Max tokens varies widely

Recommended Universal Config

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run ID	Quality	Cost	Causal Resolution
dry_36149509754d	0.6790	$0.01	0.3968
dry_623dac4cc243	0.7417	$0.01	0.5113
dry_ee6fb24eac5a	0.8950	$0.0148	0.7875
dry_f06c229e5d91	0.8958	$0.0604	0.7913
dry_ff30f2622b09	0.9046	$0.0635	0.8136
dry_ec8d19ac2717	0.9067	$0.0641	0.8138
dry_fcf33023bb8c	0.9096	$0.0677	0.8277
dry_f5c97eceaa3f	0.9109	$0.1610	0.8290
dry_fcc07e5633c7	0.9169	$0.2126	0.8543

Template	Best Q	Avg Q	Best CR
hospital_crisis	0.9169	0.7751	0.8543
mars_mission_portal	0.9109	0.7489	0.8290
hound_shadow_directorial	0.9067	0.7507	0.8138
board_meeting	0.9058	0.7559	0.8150
vc_pitch_branching	0.9020	0.7582	0.8241
detective_prospection	0.9017	0.7657	0.8100
kami_shrine	0.9012	0.7487	0.8125
agent4_elk_migration	0.8964	0.7731	0.7974
jefferson_dinner	0.8958	0.7718	0.7952
sec_investigation	0.8917	0.7678	0.7949
castaway_colony_branching	0.8891	0.7587	0.7796

Autoresearch Pro-7: Cross-template generalization findings #25

Description

Summary

Global Pareto Frontier (9 configs)

Per-Template Performance

Universal Optimizations (work across ALL templates)

1. Compression: PCA dominates

2. Model: 70B-class models win on quality

3. High frequency penalty is universal

4. Moderate presence penalty

Template-Specific Optimizations

5. Temporal mode is template-dependent (NOT universal)

6. Temperature is bimodal

7. Max tokens varies widely

Recommended Universal Config

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions