feat: pi-eval-harness — formal evaluation framework for measuring agent quality

## Summary

A Pi extension that provides a formal evaluation framework for measuring agent session quality, task completion accuracy, and regression detection. Inspired by ECC's `eval-harness` skill and eval-driven development (EDD) methodology. Targeted at extension authors and teams who need to verify that agent behavior is improving (or at least not degrading) over time.

## Motivation

Extension authors (including ourselves with `pi-continuous-learning`) need to answer: "Did this change make the agent better or worse?" Without structured evals, quality is measured by vibes. An eval harness provides reproducible, scored benchmarks that can run against any agent configuration, catching regressions before they ship.

## Proposed Features

### 1. Eval Definition Format
- YAML/JSON eval specs defining:
  - **Input**: user prompt, initial file state, repo context
  - **Expected**: files created/modified, test results, specific code patterns
  - **Scoring**: binary (pass/fail), rubric (0-10), or custom scorer function
- Eval suites: groups of related evals (e.g., "TypeScript refactoring", "Python testing")

### 2. Eval Runner (`/eval`)
- `/eval run <suite>` — run all evals in a suite
- `/eval run <suite> --eval <name>` — run a specific eval
- `/eval compare <baseline> <candidate>` — compare two runs
- Runs evals in isolated environment (temp directory, clean git state)
- Parallel execution where evals are independent

### 3. Scoring and Reporting
- Per-eval scores with pass/fail/score breakdown
- Suite-level aggregate scores
- Diff reports: "3 regressions, 2 improvements, 15 unchanged"
- Historical trend tracking across runs
- Export as JSON for CI integration

### 4. Regression Detection
- Baseline recording: `/eval baseline <suite>` saves current scores
- CI integration: fail the build if any eval regresses below baseline
- Alert on score degradation trends (even if above baseline)

### 5. Extension Author Tools
- Eval generators: scaffold eval specs from existing test cases
- `pi.registerTool()`: `eval_run`, `eval_compare` for LLM-driven eval analysis
- Hooks for running evals on extension changes

## Pi Extension API Integration

| API Surface | Usage |
|---|---|
| `pi.registerCommand()` | `/eval` with subcommands |
| `pi.registerTool()` | `eval_run`, `eval_compare`, `eval_report` |
| File system | Eval specs in `.pi/evals/`, results in `~/.pi/eval-harness/` |
| Bash execution | Run evals in isolated subprocess |

## Implementation Notes

- Eval isolation: each eval runs in a fresh temp directory with controlled state
- Scoring: start with binary pass/fail, add rubric scoring later
- Storage: results in `~/.pi/eval-harness/runs/<run-id>/results.json`
- Consider: can evals use Pi's own session API to replay prompts? Would make evals more realistic.
- Audience is primarily extension authors, not end users

## Prior Art

- ECC `eval-harness`: formal evaluation framework with eval-driven development methodology
- ECC `ai-regression-testing`: regression testing for AI-assisted development
- Anthropic's own eval frameworks for Claude
- No existing Pi extension provides structured evaluation

## Effort Estimate

Medium to high. The eval runner and isolation logic are the bulk of the work. Scoring and reporting are straightforward. CI integration adds value but can be phased.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: pi-eval-harness — formal evaluation framework for measuring agent quality #85

Summary

Motivation

Proposed Features

1. Eval Definition Format

2. Eval Runner (`/eval`)

3. Scoring and Reporting

4. Regression Detection

5. Extension Author Tools

Pi Extension API Integration

Implementation Notes

Prior Art

Effort Estimate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

API Surface	Usage
`pi.registerCommand()`	`/eval` with subcommands
`pi.registerTool()`	`eval_run`, `eval_compare`, `eval_report`
File system	Eval specs in `.pi/evals/`, results in `~/.pi/eval-harness/`
Bash execution	Run evals in isolated subprocess

Uh oh!

feat: pi-eval-harness — formal evaluation framework for measuring agent quality #85

Description

Summary

Motivation

Proposed Features

1. Eval Definition Format

2. Eval Runner (/eval)

3. Scoring and Reporting

4. Regression Detection

5. Extension Author Tools

Pi Extension API Integration

Implementation Notes

Prior Art

Effort Estimate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Eval Runner (`/eval`)