Skip to content

feat: pi-eval-harness — formal evaluation framework for measuring agent quality #85

@MattDevy

Description

@MattDevy

Summary

A Pi extension that provides a formal evaluation framework for measuring agent session quality, task completion accuracy, and regression detection. Inspired by ECC's eval-harness skill and eval-driven development (EDD) methodology. Targeted at extension authors and teams who need to verify that agent behavior is improving (or at least not degrading) over time.

Motivation

Extension authors (including ourselves with pi-continuous-learning) need to answer: "Did this change make the agent better or worse?" Without structured evals, quality is measured by vibes. An eval harness provides reproducible, scored benchmarks that can run against any agent configuration, catching regressions before they ship.

Proposed Features

1. Eval Definition Format

  • YAML/JSON eval specs defining:
    • Input: user prompt, initial file state, repo context
    • Expected: files created/modified, test results, specific code patterns
    • Scoring: binary (pass/fail), rubric (0-10), or custom scorer function
  • Eval suites: groups of related evals (e.g., "TypeScript refactoring", "Python testing")

2. Eval Runner (/eval)

  • /eval run <suite> — run all evals in a suite
  • /eval run <suite> --eval <name> — run a specific eval
  • /eval compare <baseline> <candidate> — compare two runs
  • Runs evals in isolated environment (temp directory, clean git state)
  • Parallel execution where evals are independent

3. Scoring and Reporting

  • Per-eval scores with pass/fail/score breakdown
  • Suite-level aggregate scores
  • Diff reports: "3 regressions, 2 improvements, 15 unchanged"
  • Historical trend tracking across runs
  • Export as JSON for CI integration

4. Regression Detection

  • Baseline recording: /eval baseline <suite> saves current scores
  • CI integration: fail the build if any eval regresses below baseline
  • Alert on score degradation trends (even if above baseline)

5. Extension Author Tools

  • Eval generators: scaffold eval specs from existing test cases
  • pi.registerTool(): eval_run, eval_compare for LLM-driven eval analysis
  • Hooks for running evals on extension changes

Pi Extension API Integration

API Surface Usage
pi.registerCommand() /eval with subcommands
pi.registerTool() eval_run, eval_compare, eval_report
File system Eval specs in .pi/evals/, results in ~/.pi/eval-harness/
Bash execution Run evals in isolated subprocess

Implementation Notes

  • Eval isolation: each eval runs in a fresh temp directory with controlled state
  • Scoring: start with binary pass/fail, add rubric scoring later
  • Storage: results in ~/.pi/eval-harness/runs/<run-id>/results.json
  • Consider: can evals use Pi's own session API to replay prompts? Would make evals more realistic.
  • Audience is primarily extension authors, not end users

Prior Art

  • ECC eval-harness: formal evaluation framework with eval-driven development methodology
  • ECC ai-regression-testing: regression testing for AI-assisted development
  • Anthropic's own eval frameworks for Claude
  • No existing Pi extension provides structured evaluation

Effort Estimate

Medium to high. The eval runner and isolation logic are the bulk of the work. Scoring and reporting are straightforward. CI integration adds value but can be phased.

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-ideaNew extension package idea for the monorepoimpact: lowLow impact potential

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions