feat: add eval dataset and test runner infrastructure #43

dtsong · 2026-01-24T19:13:29Z

Closes #33

Summary

Complete evaluation framework for PR review agent
CLI runner with precision/recall/F1 metrics
6 synthetic test cases covering security, reliability, and quality issues
Full integration with existing CLI

Changes

New: evals/runner.py - CLI evaluation runner
New: evals/scoring.py - Metrics calculation engine
New: evals/cases/*.yaml - 6 synthetic test case definitions
New: evals/diffs/*.patch - Sample diff files for testing
New: tests/test_evals.py - Comprehensive test coverage
Modified: src/pr_review_agent/main.py - Added --eval flag
Modified: pyproject.toml - Added pr-review-eval script

Test Plan

All existing tests pass (308 tests, 96.5% coverage)
New evaluation tests pass with full coverage
CLI integration works: uv run pr-review-eval --help
Linting passes (ruff clean)
Can load test suites and calculate metrics

Usage Example

# Run evaluation on synthetic test cases
uv run pr-review-eval --suite evals/cases/

# Save detailed results
uv run pr-review-eval --suite evals/cases/ --output results.json --verbose

Test Cases Included

SQL injection - High confidence security issue
XSS vulnerability - Template escaping removal
Missing error handling - Payment processing reliability
Unused import - Code quality issue
Large function - Maintainability concern
Hardcoded secrets - Security configuration issue

🤖 Generated with Claude Code

Implements #33 with comprehensive evaluation framework: ## Core Features - CLI eval runner with `pr-review-eval` script - Precision, recall, F1, confidence scoring metrics - YAML fixture format for test case definitions - Synthetic test diffs covering common issues ## Files Added - `evals/runner.py` - CLI evaluation runner - `evals/scoring.py` - Metrics calculation - `evals/cases/*.yaml` - 6 synthetic test cases - `evals/diffs/*.patch` - Sample diff files - `tests/test_evals.py` - Full test coverage ## Integration - Added `--eval` flag to main CLI - New `pr-review-eval` script entry point - Package includes evals module ## Test Cases Synthetic fixtures for: SQL injection, XSS, missing error handling, unused imports, large functions, hardcoded secrets ## Acceptance Criteria Met ✓ `uv run pr-review-eval --suite evals/cases/` produces scoring report ✓ 6+ synthetic test fixtures covering different issue types ✓ Precision/recall/F1 scoring implemented ✓ CLI integration with `--eval` flag ✓ 96.5% test coverage maintained Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dtsong added evals Evaluation and testing infrastructure backlog Backlog items for future implementation labels Jan 24, 2026

dtsong merged commit b177fb4 into main Jan 24, 2026
2 checks passed

dtsong deleted the feat/33-eval-dataset-test-runner branch January 24, 2026 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eval dataset and test runner infrastructure #43

feat: add eval dataset and test runner infrastructure #43

Uh oh!

dtsong commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add eval dataset and test runner infrastructure #43

feat: add eval dataset and test runner infrastructure #43

Uh oh!

Conversation

dtsong commented Jan 24, 2026

Summary

Changes

Test Plan

Usage Example

Test Cases Included

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants