feat: Add support for repeated evaluations in Evaluation Report #3683

AlanPonnachan · 2025-12-09T14:57:01Z

Add support for repeated evaluations in Evaluation Report

This PR closes #2053

Enabling repeated execution of test cases to measure model stochasticity and performance stability.

Previously, Dataset.evaluate only supported running each test case once. To gauge variance in LLM responses (e.g., how often a prompt succeeds), users had to manually loop and aggregate reports. This PR adds native support for repeated evaluations, automatically aggregating results into a unified report.

Key Changes:

Dataset API: Updated evaluate and evaluate_sync to accept a runs parameter (default 1). When runs > 1, each test case is executed multiple times.
New Reporting Entity: Introduced ReportCaseMultiRun to represent a case that has been executed multiple times. It stores the individual ReportCase objects for each run and a ReportCaseAggregate for summary statistics.
Rendering: Enhanced EvaluationRenderer to handle ReportCaseMultiRun.
- Console tables now display aggregated scores/metrics for multi-run cases (marked with e.g., case_id (3 runs)).
- Diff tables support comparing a single-run baseline against a multi-run experiment (and vice-versa) by comparing values against aggregates.
Aggregation Logic: Refactored ReportCaseAggregate.average to robustly calculate statistics from a mix of single ReportCase and ReportCaseMultiRun objects.
Tests: Added tests/evals/test_repeated_runs.py covering execution loops, mixed failure/success scenarios, and rendering verification.

Example Usage:

from pydantic_evals import Dataset

dataset = Dataset(...)

# Run each test case 5 times to smooth out variance
report = await dataset.evaluate(my_agent, runs=5)

# The report will show aggregated scores (e.g., accuracy 80% if 4/5 passed)
report.print()

AlanPonnachan · 2025-12-09T15:25:18Z

tests/evals/test_reporting.py

-│ Averages  │                           │ score1: 2.50 │ label1: {'hello': 1.0} │ accuracy: 0.950 │ 100.0% ✔   │  100.0ms │
-└───────────┴───────────────────────────┴──────────────┴────────────────────────┴─────────────────┴────────────┴──────────┘
+┏━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
+┃          ┃          ┃           ┃          ┃           ┃ Assertio ┃          ┃


The tests are failing because codespell flags this line Assertio ==> Assertion.

However, these lines are automatically generated by the snapshot update, so I’m unable to modify them directly.
Looking for guidance on how to handle this situation.

AlanPonnachan and others added 3 commits December 9, 2025 14:45

add support for repeated evaluations

e0c3528

improvements

34e78ad

Merge branch 'main' into feature/repeated-evaluations

be7815e

AlanPonnachan commented Dec 9, 2025

View reviewed changes

DouweM assigned dmontagu Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add support for repeated evaluations in Evaluation Report #3683

feat: Add support for repeated evaluations in Evaluation Report #3683

Uh oh!

AlanPonnachan commented Dec 9, 2025

Uh oh!

AlanPonnachan Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add support for repeated evaluations in Evaluation Report #3683

Are you sure you want to change the base?

feat: Add support for repeated evaluations in Evaluation Report #3683

Uh oh!

Conversation

AlanPonnachan commented Dec 9, 2025

Add support for repeated evaluations in Evaluation Report

Key Changes:

Example Usage:

Uh oh!

AlanPonnachan Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants