Skip to content

Variance-aware scoring for noisy benchmarks #4

@alokwhitewolf

Description

@alokwhitewolf

Evo compares experiments using a single run each. For noisy benchmarks (LLM calls, sampling, anything network-dependent), one lucky run can land as the new best score and bias every comparison that comes after it.

Should run noisy benchmarks a few times and compare aggregates before deciding. Deterministic benchmarks don't need it.

Hit this thinking about what the optimize skill should actually do with non-deterministic benchmarks — right now it just trusts whatever score came back, which is wrong for anything LLM-driven.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions