-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
autoresearchAutoresearch optimization loopAutoresearch optimization loopfindingAutoresearch discoveryAutoresearch discovery
Description
Summary
Pro-7 ran a full cross-template generalization sweep: 50 iterations x 11 templates (550 total configs), dry-run mode, seed 1100, 27-dimensional config space.
Branch: autoresearch/pro/generalize
Results: autoresearch/results/dry_run_20260316_083804.jsonl, autoresearch/results/pareto_20260316_083804.json
Global Pareto Frontier (9 configs)
| Run ID | Quality | Cost | Causal Resolution |
|---|---|---|---|
| dry_36149509754d | 0.6790 | $0.01 | 0.3968 |
| dry_623dac4cc243 | 0.7417 | $0.01 | 0.5113 |
| dry_ee6fb24eac5a | 0.8950 | $0.0148 | 0.7875 |
| dry_f06c229e5d91 | 0.8958 | $0.0604 | 0.7913 |
| dry_ff30f2622b09 | 0.9046 | $0.0635 | 0.8136 |
| dry_ec8d19ac2717 | 0.9067 | $0.0641 | 0.8138 |
| dry_fcf33023bb8c | 0.9096 | $0.0677 | 0.8277 |
| dry_f5c97eceaa3f | 0.9109 | $0.1610 | 0.8290 |
| dry_fcc07e5633c7 | 0.9169 | $0.2126 | 0.8543 |
Best quality: q=0.9169 (hospital_crisis template context)
Best efficiency: eff=74.17 (8b model at $0.01)
Per-Template Performance
| Template | Best Q | Avg Q | Best CR |
|---|---|---|---|
| hospital_crisis | 0.9169 | 0.7751 | 0.8543 |
| mars_mission_portal | 0.9109 | 0.7489 | 0.8290 |
| hound_shadow_directorial | 0.9067 | 0.7507 | 0.8138 |
| board_meeting | 0.9058 | 0.7559 | 0.8150 |
| vc_pitch_branching | 0.9020 | 0.7582 | 0.8241 |
| detective_prospection | 0.9017 | 0.7657 | 0.8100 |
| kami_shrine | 0.9012 | 0.7487 | 0.8125 |
| agent4_elk_migration | 0.8964 | 0.7731 | 0.7974 |
| jefferson_dinner | 0.8958 | 0.7718 | 0.7952 |
| sec_investigation | 0.8917 | 0.7678 | 0.7949 |
| castaway_colony_branching | 0.8891 | 0.7587 | 0.7796 |
Quality ceiling is remarkably consistent across templates (0.889-0.917 range), suggesting the config space has strong universal optima.
Universal Optimizations (work across ALL templates)
1. Compression: PCA dominates
- PCA chosen in 8/11 per-template bests (73%), SVD in 2, NMF in 1
- PCA is the safe universal default for tensor compression
2. Model: 70B-class models win on quality
- Llama 3.1 70B: 5/11 templates, Qwen 2.5 72B: 3/11
- 70B+ models account for 8/11 template bests
- However, 8B achieves the best cost-efficiency on the Pareto frontier ($0.01 runs)
3. High frequency penalty is universal
- Mean frequency_penalty across bests: 0.665 (range 0.025-0.907)
- 9/11 template bests use freq_penalty > 0.5
- This is the strongest universal signal: higher frequency penalty improves quality across all templates
4. Moderate presence penalty
- Mean presence_penalty: 0.492 (range 0.256-0.813)
- Less extreme than frequency penalty but consistently in the 0.3-0.8 range
Template-Specific Optimizations
5. Temporal mode is template-dependent (NOT universal)
- directorial: 4/11 templates (hound_shadow, jefferson_dinner, kami_shrine, castaway_colony)
- portal: 4/11 templates (hospital_crisis, mars_mission, vc_pitch, board_meeting)
- forward: 3/11 templates (detective_prospection, sec_investigation, agent4_elk)
- No single temporal mode dominates. This is the primary template-specific dimension.
6. Temperature is bimodal
- Some templates prefer low temp (0.15-0.37), others high (0.80-0.98)
- Mean 0.580 with high std (0.323) -- NOT a universal setting
- Suggests narrative-heavy templates prefer higher temperature, analytical templates prefer lower
7. Max tokens varies widely
- Range: 1389-7869, mean 3807
- Token budget should be tuned per template complexity
Recommended Universal Config
Based on cross-template analysis, these settings generalize well:
autoresearch.compression_method=pca
llm_service.defaults.model=meta-llama/llama-3.1-70b-instruct
llm_service.defaults.frequency_penalty=0.70
llm_service.defaults.presence_penalty=0.45
Template-specific tuning should focus on:
temporal_mode.active_mode(directorial vs portal vs forward)llm_service.defaults.temperature(0.2 vs 0.8 depending on template type)llm_service.defaults.max_tokens(scale with template complexity)
Next Steps
- Run live (non-dry-run) validation of top 3 Pareto configs
- Test if temporal mode correlates with template narrative structure (branching templates prefer portal/directorial?)
- Verify frequency_penalty finding holds with real API calls
- Create per-template config presets based on these findings
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
autoresearchAutoresearch optimization loopAutoresearch optimization loopfindingAutoresearch discoveryAutoresearch discovery