Variance-aware scoring for noisy benchmarks

Evo compares experiments using a single run each. For noisy benchmarks (LLM calls, sampling, anything network-dependent), one lucky run can land as the new best score and bias every comparison that comes after it.

Should run noisy benchmarks a few times and compare aggregates before deciding. Deterministic benchmarks don't need it.

Hit this thinking about what the optimize skill should actually do with non-deterministic benchmarks — right now it just trusts whatever score came back, which is wrong for anything LLM-driven.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variance-aware scoring for noisy benchmarks #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Variance-aware scoring for noisy benchmarks #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions