Skip to content

Conversation

@dtsong
Copy link
Owner

@dtsong dtsong commented Jan 24, 2026

Closes #33

Summary

  • Complete evaluation framework for PR review agent
  • CLI runner with precision/recall/F1 metrics
  • 6 synthetic test cases covering security, reliability, and quality issues
  • Full integration with existing CLI

Changes

  • New: evals/runner.py - CLI evaluation runner
  • New: evals/scoring.py - Metrics calculation engine
  • New: evals/cases/*.yaml - 6 synthetic test case definitions
  • New: evals/diffs/*.patch - Sample diff files for testing
  • New: tests/test_evals.py - Comprehensive test coverage
  • Modified: src/pr_review_agent/main.py - Added --eval flag
  • Modified: pyproject.toml - Added pr-review-eval script

Test Plan

  • All existing tests pass (308 tests, 96.5% coverage)
  • New evaluation tests pass with full coverage
  • CLI integration works: uv run pr-review-eval --help
  • Linting passes (ruff clean)
  • Can load test suites and calculate metrics

Usage Example

# Run evaluation on synthetic test cases
uv run pr-review-eval --suite evals/cases/

# Save detailed results
uv run pr-review-eval --suite evals/cases/ --output results.json --verbose

Test Cases Included

  1. SQL injection - High confidence security issue
  2. XSS vulnerability - Template escaping removal
  3. Missing error handling - Payment processing reliability
  4. Unused import - Code quality issue
  5. Large function - Maintainability concern
  6. Hardcoded secrets - Security configuration issue

🤖 Generated with Claude Code

Implements #33 with comprehensive evaluation framework:

## Core Features
- CLI eval runner with `pr-review-eval` script
- Precision, recall, F1, confidence scoring metrics
- YAML fixture format for test case definitions
- Synthetic test diffs covering common issues

## Files Added
- `evals/runner.py` - CLI evaluation runner
- `evals/scoring.py` - Metrics calculation
- `evals/cases/*.yaml` - 6 synthetic test cases
- `evals/diffs/*.patch` - Sample diff files
- `tests/test_evals.py` - Full test coverage

## Integration
- Added `--eval` flag to main CLI
- New `pr-review-eval` script entry point
- Package includes evals module

## Test Cases
Synthetic fixtures for: SQL injection, XSS, missing error handling,
unused imports, large functions, hardcoded secrets

## Acceptance Criteria Met
✓ `uv run pr-review-eval --suite evals/cases/` produces scoring report
✓ 6+ synthetic test fixtures covering different issue types
✓ Precision/recall/F1 scoring implemented
✓ CLI integration with `--eval` flag
✓ 96.5% test coverage maintained

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dtsong dtsong added evals Evaluation and testing infrastructure backlog Backlog items for future implementation labels Jan 24, 2026
@dtsong dtsong merged commit b177fb4 into main Jan 24, 2026
2 checks passed
@dtsong dtsong deleted the feat/33-eval-dataset-test-runner branch January 24, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backlog Backlog items for future implementation evals Evaluation and testing infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval dataset & test runner

2 participants