Summary
A Pi extension that provides a formal evaluation framework for measuring agent session quality, task completion accuracy, and regression detection. Inspired by ECC's eval-harness skill and eval-driven development (EDD) methodology. Targeted at extension authors and teams who need to verify that agent behavior is improving (or at least not degrading) over time.
Motivation
Extension authors (including ourselves with pi-continuous-learning) need to answer: "Did this change make the agent better or worse?" Without structured evals, quality is measured by vibes. An eval harness provides reproducible, scored benchmarks that can run against any agent configuration, catching regressions before they ship.
Proposed Features
1. Eval Definition Format
- YAML/JSON eval specs defining:
- Input: user prompt, initial file state, repo context
- Expected: files created/modified, test results, specific code patterns
- Scoring: binary (pass/fail), rubric (0-10), or custom scorer function
- Eval suites: groups of related evals (e.g., "TypeScript refactoring", "Python testing")
2. Eval Runner (/eval)
/eval run <suite> — run all evals in a suite
/eval run <suite> --eval <name> — run a specific eval
/eval compare <baseline> <candidate> — compare two runs
- Runs evals in isolated environment (temp directory, clean git state)
- Parallel execution where evals are independent
3. Scoring and Reporting
- Per-eval scores with pass/fail/score breakdown
- Suite-level aggregate scores
- Diff reports: "3 regressions, 2 improvements, 15 unchanged"
- Historical trend tracking across runs
- Export as JSON for CI integration
4. Regression Detection
- Baseline recording:
/eval baseline <suite> saves current scores
- CI integration: fail the build if any eval regresses below baseline
- Alert on score degradation trends (even if above baseline)
5. Extension Author Tools
- Eval generators: scaffold eval specs from existing test cases
pi.registerTool(): eval_run, eval_compare for LLM-driven eval analysis
- Hooks for running evals on extension changes
Pi Extension API Integration
| API Surface |
Usage |
pi.registerCommand() |
/eval with subcommands |
pi.registerTool() |
eval_run, eval_compare, eval_report |
| File system |
Eval specs in .pi/evals/, results in ~/.pi/eval-harness/ |
| Bash execution |
Run evals in isolated subprocess |
Implementation Notes
- Eval isolation: each eval runs in a fresh temp directory with controlled state
- Scoring: start with binary pass/fail, add rubric scoring later
- Storage: results in
~/.pi/eval-harness/runs/<run-id>/results.json
- Consider: can evals use Pi's own session API to replay prompts? Would make evals more realistic.
- Audience is primarily extension authors, not end users
Prior Art
- ECC
eval-harness: formal evaluation framework with eval-driven development methodology
- ECC
ai-regression-testing: regression testing for AI-assisted development
- Anthropic's own eval frameworks for Claude
- No existing Pi extension provides structured evaluation
Effort Estimate
Medium to high. The eval runner and isolation logic are the bulk of the work. Scoring and reporting are straightforward. CI integration adds value but can be phased.
Summary
A Pi extension that provides a formal evaluation framework for measuring agent session quality, task completion accuracy, and regression detection. Inspired by ECC's
eval-harnessskill and eval-driven development (EDD) methodology. Targeted at extension authors and teams who need to verify that agent behavior is improving (or at least not degrading) over time.Motivation
Extension authors (including ourselves with
pi-continuous-learning) need to answer: "Did this change make the agent better or worse?" Without structured evals, quality is measured by vibes. An eval harness provides reproducible, scored benchmarks that can run against any agent configuration, catching regressions before they ship.Proposed Features
1. Eval Definition Format
2. Eval Runner (
/eval)/eval run <suite>— run all evals in a suite/eval run <suite> --eval <name>— run a specific eval/eval compare <baseline> <candidate>— compare two runs3. Scoring and Reporting
4. Regression Detection
/eval baseline <suite>saves current scores5. Extension Author Tools
pi.registerTool():eval_run,eval_comparefor LLM-driven eval analysisPi Extension API Integration
pi.registerCommand()/evalwith subcommandspi.registerTool()eval_run,eval_compare,eval_report.pi/evals/, results in~/.pi/eval-harness/Implementation Notes
~/.pi/eval-harness/runs/<run-id>/results.jsonPrior Art
eval-harness: formal evaluation framework with eval-driven development methodologyai-regression-testing: regression testing for AI-assisted developmentEffort Estimate
Medium to high. The eval runner and isolation logic are the bulk of the work. Scoring and reporting are straightforward. CI integration adds value but can be phased.