-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Pro-8 TDF + training data quality sweep: 30 iterations x 11 templates (330 total configs), dry-run mode, seed 1200, 27-dimensional config space across 6 mechanism clusters (fidelity, temporal, knowledge, entity, model, dialog).
Pareto Frontier (11 globally optimal configs)
| Run ID | quality_composite | cost_usd | causal_resolution |
|---|---|---|---|
| dry_c6b8017a0e9a | 0.8474 | $0.01 | 0.7003 |
| dry_e6decce10b69 | 0.8680 | $0.02 | 0.7439 |
| dry_e0bb43f090ba | 0.8723 | $0.03 | 0.7351 |
| dry_f0ee92879a71 | 0.8794 | $0.03 | 0.7792 |
| dry_e0ab3a5b4da2 | 0.8801 | $0.04 | 0.7696 |
| dry_eaa9ce2a1863 | 0.8898 | $0.05 | 0.7625 |
| dry_f552c3ba6ea3 | 0.9016 | $0.05 | 0.8109 |
| dry_fa024f540806 | 0.9032 | $0.12 | 0.8191 |
| dry_fab762e388f0 | 0.9034 | $0.14 | 0.8081 |
| dry_fa3fd5bea489 | 0.9133 | $0.14 | 0.8327 |
| dry_fe43c441925c | 0.9176 | $0.19 | 0.8478 |
Best quality: dry_fe43c441925c (q=0.9176)
Best efficiency: dry_c6b8017a0e9a (eff=84.74 quality/$)
Highest Quality Config (Best for TDF Export / Training Data)
The config that produces the highest quality_composite (0.9176) across all 11 templates, and therefore the best training data for downstream fine-tuning:
Model: qwen/qwen-2.5-72b-instruct (temperature=0.5348, top_p=0.7288, max_tokens=4183)
Compression: NMF with 2 components
Temporal: directorial mode, dramatic_tension=0.4698, low foreshadowing (0.066), coincidence_boost=1.065
Knowledge: forecast_horizon=50d, max_expectations=8, anxiety_conservatism=0.35
Entity: animism_level=6, night_penalty=1.07, fatigue_accumulation=0.33
Dialog: very low frequency_penalty (0.097), low presence_penalty (0.174)
Key Findings for Training Data Quality
1. Model selection dominates quality
- Top-tier quality (>0.90) consistently uses either
qwen/qwen-2.5-72b-instructormistralai/mistral-large-latest - Budget tier (q=0.84-0.89) achievable with
meta-llama/llama-3.1-8b-instructat ~5-10x lower cost deepseek/deepseek-chatlands in the middle at moderate cost
2. Temperature sweet spot: 0.5-0.95
- Best quality config uses temperature=0.5348 (moderate, coherent)
- Highest efficiency configs tend toward 0.9-1.05 (more creative but noisier)
- Very low (<0.3) or very high (>1.1) temperatures hurt quality_composite
3. NMF compression outperforms PCA/SVD for quality
- 5 of 7 configs with q>0.90 use NMF
- Fewer components (2-7) is better than many (10) for quality
- SVD appears in one high-quality config (cyclical mode, q=0.8794)
4. Directorial and cyclical temporal modes produce highest quality
- The top config uses directorial mode with moderate dramatic tension (~0.47)
- Cyclical mode appears in 3 of the top 7 Pareto configs
- Branching mode is mid-tier; portal/forward not represented on the frontier
5. Dialog penalties should be low for quality
- Best quality config: freq_penalty=0.097, presence_penalty=0.174
- High penalties (>0.5) appear only in budget-tier configs
- This makes sense: low repetition penalties allow richer, more detailed narrative outputs
6. High animism level correlates with quality
- Top config uses animism_level=6 (max)
- Budget configs use level 2-4
- Higher animism = more entity variety = richer training data
Cost Implications
| Tier | quality_composite | cost_usd | Recommended Use |
|---|---|---|---|
| Premium (qwen-72b/mistral-large) | 0.90-0.92 | $0.12-0.19 | Production TDF export, fine-tuning dataset |
| Mid-range (8b + high tokens) | 0.88-0.90 | $0.04-0.05 | Validation runs, secondary training data |
| Budget (8b, low tokens) | 0.84-0.87 | $0.01-0.03 | Rapid prototyping, parameter sweeps |
For a 1000-run training dataset: Premium = ~$150-190, Mid-range = ~$40-50, Budget = ~$10-30.
Recommendation
For TDF export quality targeting downstream fine-tuning:
- Use the Premium config (qwen-72b, NMF-2, directorial, low dialog penalties) for the core training set
- Augment with Mid-range configs across different temporal modes for diversity
- The 19x cost difference between budget and premium is justified by the 8% quality improvement (0.84 vs 0.92), which compounds through fine-tuning loss curves
Artifacts
- Results JSONL:
autoresearch/results/dry_run_20260316_083717.jsonl(330 runs) - Pareto frontier:
autoresearch/results/pareto_20260316_083717.json(11 optimal configs) - Branch:
autoresearch/pro/tdf-training - Seed: 1200, iterations: 30/template, templates: 11 (all)