Skip to content

Conversation

@AlanPonnachan
Copy link
Contributor

Add support for repeated evaluations in Evaluation Report

This PR closes #2053

Enabling repeated execution of test cases to measure model stochasticity and performance stability.

Previously, Dataset.evaluate only supported running each test case once. To gauge variance in LLM responses (e.g., how often a prompt succeeds), users had to manually loop and aggregate reports. This PR adds native support for repeated evaluations, automatically aggregating results into a unified report.

Key Changes:

  • Dataset API: Updated evaluate and evaluate_sync to accept a runs parameter (default 1). When runs > 1, each test case is executed multiple times.
  • New Reporting Entity: Introduced ReportCaseMultiRun to represent a case that has been executed multiple times. It stores the individual ReportCase objects for each run and a ReportCaseAggregate for summary statistics.
  • Rendering: Enhanced EvaluationRenderer to handle ReportCaseMultiRun.
    • Console tables now display aggregated scores/metrics for multi-run cases (marked with e.g., case_id (3 runs)).
    • Diff tables support comparing a single-run baseline against a multi-run experiment (and vice-versa) by comparing values against aggregates.
  • Aggregation Logic: Refactored ReportCaseAggregate.average to robustly calculate statistics from a mix of single ReportCase and ReportCaseMultiRun objects.
  • Tests: Added tests/evals/test_repeated_runs.py covering execution loops, mixed failure/success scenarios, and rendering verification.

Example Usage:

from pydantic_evals import Dataset

dataset = Dataset(...)

# Run each test case 5 times to smooth out variance
report = await dataset.evaluate(my_agent, runs=5)

# The report will show aggregated scores (e.g., accuracy 80% if 4/5 passed)
report.print()

│ Averages │ │ score1: 2.50 │ label1: {'hello': 1.0} │ accuracy: 0.950 │ 100.0% ✔ │ 100.0ms │
└───────────┴───────────────────────────┴──────────────┴────────────────────────┴─────────────────┴────────────┴──────────┘
┏━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ ┃ ┃ ┃ Assertio ┃ ┃
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are failing because codespell flags this line Assertio ==> Assertion.

However, these lines are automatically generated by the snapshot update, so I’m unable to modify them directly.
Looking for guidance on how to handle this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Repeated Evaluations in Evaluation Report

2 participants