Skip to content

Autoresearch Pro-8: TDF + training data quality findings #24

@realityinspector

Description

@realityinspector

Summary

Pro-8 TDF + training data quality sweep: 30 iterations x 11 templates (330 total configs), dry-run mode, seed 1200, 27-dimensional config space across 6 mechanism clusters (fidelity, temporal, knowledge, entity, model, dialog).

Pareto Frontier (11 globally optimal configs)

Run ID quality_composite cost_usd causal_resolution
dry_c6b8017a0e9a 0.8474 $0.01 0.7003
dry_e6decce10b69 0.8680 $0.02 0.7439
dry_e0bb43f090ba 0.8723 $0.03 0.7351
dry_f0ee92879a71 0.8794 $0.03 0.7792
dry_e0ab3a5b4da2 0.8801 $0.04 0.7696
dry_eaa9ce2a1863 0.8898 $0.05 0.7625
dry_f552c3ba6ea3 0.9016 $0.05 0.8109
dry_fa024f540806 0.9032 $0.12 0.8191
dry_fab762e388f0 0.9034 $0.14 0.8081
dry_fa3fd5bea489 0.9133 $0.14 0.8327
dry_fe43c441925c 0.9176 $0.19 0.8478

Best quality: dry_fe43c441925c (q=0.9176)
Best efficiency: dry_c6b8017a0e9a (eff=84.74 quality/$)

Highest Quality Config (Best for TDF Export / Training Data)

The config that produces the highest quality_composite (0.9176) across all 11 templates, and therefore the best training data for downstream fine-tuning:

Model: qwen/qwen-2.5-72b-instruct (temperature=0.5348, top_p=0.7288, max_tokens=4183)
Compression: NMF with 2 components
Temporal: directorial mode, dramatic_tension=0.4698, low foreshadowing (0.066), coincidence_boost=1.065
Knowledge: forecast_horizon=50d, max_expectations=8, anxiety_conservatism=0.35
Entity: animism_level=6, night_penalty=1.07, fatigue_accumulation=0.33
Dialog: very low frequency_penalty (0.097), low presence_penalty (0.174)

Key Findings for Training Data Quality

1. Model selection dominates quality

  • Top-tier quality (>0.90) consistently uses either qwen/qwen-2.5-72b-instruct or mistralai/mistral-large-latest
  • Budget tier (q=0.84-0.89) achievable with meta-llama/llama-3.1-8b-instruct at ~5-10x lower cost
  • deepseek/deepseek-chat lands in the middle at moderate cost

2. Temperature sweet spot: 0.5-0.95

  • Best quality config uses temperature=0.5348 (moderate, coherent)
  • Highest efficiency configs tend toward 0.9-1.05 (more creative but noisier)
  • Very low (<0.3) or very high (>1.1) temperatures hurt quality_composite

3. NMF compression outperforms PCA/SVD for quality

  • 5 of 7 configs with q>0.90 use NMF
  • Fewer components (2-7) is better than many (10) for quality
  • SVD appears in one high-quality config (cyclical mode, q=0.8794)

4. Directorial and cyclical temporal modes produce highest quality

  • The top config uses directorial mode with moderate dramatic tension (~0.47)
  • Cyclical mode appears in 3 of the top 7 Pareto configs
  • Branching mode is mid-tier; portal/forward not represented on the frontier

5. Dialog penalties should be low for quality

  • Best quality config: freq_penalty=0.097, presence_penalty=0.174
  • High penalties (>0.5) appear only in budget-tier configs
  • This makes sense: low repetition penalties allow richer, more detailed narrative outputs

6. High animism level correlates with quality

  • Top config uses animism_level=6 (max)
  • Budget configs use level 2-4
  • Higher animism = more entity variety = richer training data

Cost Implications

Tier quality_composite cost_usd Recommended Use
Premium (qwen-72b/mistral-large) 0.90-0.92 $0.12-0.19 Production TDF export, fine-tuning dataset
Mid-range (8b + high tokens) 0.88-0.90 $0.04-0.05 Validation runs, secondary training data
Budget (8b, low tokens) 0.84-0.87 $0.01-0.03 Rapid prototyping, parameter sweeps

For a 1000-run training dataset: Premium = ~$150-190, Mid-range = ~$40-50, Budget = ~$10-30.

Recommendation

For TDF export quality targeting downstream fine-tuning:

  1. Use the Premium config (qwen-72b, NMF-2, directorial, low dialog penalties) for the core training set
  2. Augment with Mid-range configs across different temporal modes for diversity
  3. The 19x cost difference between budget and premium is justified by the 8% quality improvement (0.84 vs 0.92), which compounds through fine-tuning loss curves

Artifacts

  • Results JSONL: autoresearch/results/dry_run_20260316_083717.jsonl (330 runs)
  • Pareto frontier: autoresearch/results/pareto_20260316_083717.json (11 optimal configs)
  • Branch: autoresearch/pro/tdf-training
  • Seed: 1200, iterations: 30/template, templates: 11 (all)

Metadata

Metadata

Assignees

No one assigned

    Labels

    autoresearchAutoresearch optimization loopfindingAutoresearch discovery

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions