Verdict accuracy benchmarking and feedback dashboard for repo-gardener.
Live dashboard: https://gardener-report.pages.dev
Gardener posts verdicts (ALIGNED, NEEDS_REVIEW, CONFLICT, etc.) on PRs and issues. But how accurate are those verdicts? This tool measures that by comparing gardener's calls against what actually happened.
One number: Verdict Accuracy % — the percentage of gardener verdicts that match the maintainer's actual decision.
Each PR gets a verdict (what gardener said) and an outcome (what actually happened):
| Gardener said | Merged cleanly | Merged after revision | Maintainer rejected |
|---|---|---|---|
| ALIGNED | Correct | Partial | Wrong |
| NEEDS_REVIEW | Wrong (false alarm) | Correct | Correct |
| CONFLICT | Wrong (false alarm) | Correct | Correct |
Excluded from scoring:
- Author-withdrawn PRs — the author closed their own PR. Gardener can't predict humans leaving.
- Pending PRs — still open, no outcome yet. Scored when re-run later.
Formula: (correct + 0.5 × partial) / scorable × 100
# Score a repo where gardener has commented
python3 src/score.py --repo owner/repo
# Build the HTML dashboard
python3 src/build_dashboard.py --repo owner/repo
# Run both (fetch + score + build + deploy)
./bench.sh owner/repoEach scored repo gets a directory under reports/:
reports/
paperclipai-paperclip/
2026-04-12/
accuracy.json # Structured accuracy data
pr_scores.json # Per-PR scoring detail
dashboard.html # Interactive HTML report
Live dashboards are deployed to Cloudflare Pages:
- paperclipai/paperclip: https://gardener-report.pages.dev
The most valuable output is the wrong calls list. Each wrong call becomes a case study for improving gardener. The flow:
gardener-benchscores a repo → finds wrong/partial calls- Wrong calls are filed as issues on repo-gardener with the
accuracy-reportlabel - Humans review and decide what to change in gardener's prompts or tree-reading
- Re-run scoring after changes → watch accuracy % go up
This is a human-reviewed feedback loop, not auto-applied.
Beyond accuracy, the dashboard tracks how humans interact with gardener:
- Engaging replies — comments that reference, quote, or respond to gardener's review
- Signal detection — mentions-gardener, references-verdict, quotes-gardener, addresses-review
- Engagement rate — what % of threads have human engagement with gardener
- repo-gardener — the review bot itself
- Feature proposal:
repo-gardener score— integrating scoring into gardener CLI - First accuracy report: paperclipai/paperclip