Evo compares experiments using a single run each. For noisy benchmarks (LLM calls, sampling, anything network-dependent), one lucky run can land as the new best score and bias every comparison that comes after it.
Should run noisy benchmarks a few times and compare aggregates before deciding. Deterministic benchmarks don't need it.
Hit this thinking about what the optimize skill should actually do with non-deterministic benchmarks — right now it just trusts whatever score came back, which is wrong for anything LLM-driven.
Evo compares experiments using a single run each. For noisy benchmarks (LLM calls, sampling, anything network-dependent), one lucky run can land as the new best score and bias every comparison that comes after it.
Should run noisy benchmarks a few times and compare aggregates before deciding. Deterministic benchmarks don't need it.
Hit this thinking about what the optimize skill should actually do with non-deterministic benchmarks — right now it just trusts whatever score came back, which is wrong for anything LLM-driven.