Skip to content

Autoresearch Pro-7: Cross-template generalization findings #25

@realityinspector

Description

@realityinspector

Summary

Pro-7 ran a full cross-template generalization sweep: 50 iterations x 11 templates (550 total configs), dry-run mode, seed 1100, 27-dimensional config space.

Branch: autoresearch/pro/generalize
Results: autoresearch/results/dry_run_20260316_083804.jsonl, autoresearch/results/pareto_20260316_083804.json

Global Pareto Frontier (9 configs)

Run ID Quality Cost Causal Resolution
dry_36149509754d 0.6790 $0.01 0.3968
dry_623dac4cc243 0.7417 $0.01 0.5113
dry_ee6fb24eac5a 0.8950 $0.0148 0.7875
dry_f06c229e5d91 0.8958 $0.0604 0.7913
dry_ff30f2622b09 0.9046 $0.0635 0.8136
dry_ec8d19ac2717 0.9067 $0.0641 0.8138
dry_fcf33023bb8c 0.9096 $0.0677 0.8277
dry_f5c97eceaa3f 0.9109 $0.1610 0.8290
dry_fcc07e5633c7 0.9169 $0.2126 0.8543

Best quality: q=0.9169 (hospital_crisis template context)
Best efficiency: eff=74.17 (8b model at $0.01)

Per-Template Performance

Template Best Q Avg Q Best CR
hospital_crisis 0.9169 0.7751 0.8543
mars_mission_portal 0.9109 0.7489 0.8290
hound_shadow_directorial 0.9067 0.7507 0.8138
board_meeting 0.9058 0.7559 0.8150
vc_pitch_branching 0.9020 0.7582 0.8241
detective_prospection 0.9017 0.7657 0.8100
kami_shrine 0.9012 0.7487 0.8125
agent4_elk_migration 0.8964 0.7731 0.7974
jefferson_dinner 0.8958 0.7718 0.7952
sec_investigation 0.8917 0.7678 0.7949
castaway_colony_branching 0.8891 0.7587 0.7796

Quality ceiling is remarkably consistent across templates (0.889-0.917 range), suggesting the config space has strong universal optima.

Universal Optimizations (work across ALL templates)

1. Compression: PCA dominates

  • PCA chosen in 8/11 per-template bests (73%), SVD in 2, NMF in 1
  • PCA is the safe universal default for tensor compression

2. Model: 70B-class models win on quality

  • Llama 3.1 70B: 5/11 templates, Qwen 2.5 72B: 3/11
  • 70B+ models account for 8/11 template bests
  • However, 8B achieves the best cost-efficiency on the Pareto frontier ($0.01 runs)

3. High frequency penalty is universal

  • Mean frequency_penalty across bests: 0.665 (range 0.025-0.907)
  • 9/11 template bests use freq_penalty > 0.5
  • This is the strongest universal signal: higher frequency penalty improves quality across all templates

4. Moderate presence penalty

  • Mean presence_penalty: 0.492 (range 0.256-0.813)
  • Less extreme than frequency penalty but consistently in the 0.3-0.8 range

Template-Specific Optimizations

5. Temporal mode is template-dependent (NOT universal)

  • directorial: 4/11 templates (hound_shadow, jefferson_dinner, kami_shrine, castaway_colony)
  • portal: 4/11 templates (hospital_crisis, mars_mission, vc_pitch, board_meeting)
  • forward: 3/11 templates (detective_prospection, sec_investigation, agent4_elk)
  • No single temporal mode dominates. This is the primary template-specific dimension.

6. Temperature is bimodal

  • Some templates prefer low temp (0.15-0.37), others high (0.80-0.98)
  • Mean 0.580 with high std (0.323) -- NOT a universal setting
  • Suggests narrative-heavy templates prefer higher temperature, analytical templates prefer lower

7. Max tokens varies widely

  • Range: 1389-7869, mean 3807
  • Token budget should be tuned per template complexity

Recommended Universal Config

Based on cross-template analysis, these settings generalize well:

autoresearch.compression_method=pca
llm_service.defaults.model=meta-llama/llama-3.1-70b-instruct
llm_service.defaults.frequency_penalty=0.70
llm_service.defaults.presence_penalty=0.45

Template-specific tuning should focus on:

  • temporal_mode.active_mode (directorial vs portal vs forward)
  • llm_service.defaults.temperature (0.2 vs 0.8 depending on template type)
  • llm_service.defaults.max_tokens (scale with template complexity)

Next Steps

  • Run live (non-dry-run) validation of top 3 Pareto configs
  • Test if temporal mode correlates with template narrative structure (branching templates prefer portal/directorial?)
  • Verify frequency_penalty finding holds with real API calls
  • Create per-template config presets based on these findings

Metadata

Metadata

Assignees

No one assigned

    Labels

    autoresearchAutoresearch optimization loopfindingAutoresearch discovery

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions