gardener-bench

Verdict accuracy benchmarking and feedback dashboard for repo-gardener.

Live dashboard: https://gardener-report.pages.dev

What this does

Gardener posts verdicts (ALIGNED, NEEDS_REVIEW, CONFLICT, etc.) on PRs and issues. But how accurate are those verdicts? This tool measures that by comparing gardener's calls against what actually happened.

One number: Verdict Accuracy % — the percentage of gardener verdicts that match the maintainer's actual decision.

How scoring works

Each PR gets a verdict (what gardener said) and an outcome (what actually happened):

Gardener said	Merged cleanly	Merged after revision	Maintainer rejected
ALIGNED	Correct	Partial	Wrong
NEEDS_REVIEW	Wrong (false alarm)	Correct	Correct
CONFLICT	Wrong (false alarm)	Correct	Correct

Excluded from scoring:

Author-withdrawn PRs — the author closed their own PR. Gardener can't predict humans leaving.
Pending PRs — still open, no outcome yet. Scored when re-run later.

Formula: (correct + 0.5 × partial) / scorable × 100

Usage

# Score a repo where gardener has commented
python3 src/score.py --repo owner/repo

# Build the HTML dashboard
python3 src/build_dashboard.py --repo owner/repo

# Run both (fetch + score + build + deploy)
./bench.sh owner/repo

Reports

Each scored repo gets a directory under reports/:

reports/
  paperclipai-paperclip/
    2026-04-12/
      accuracy.json      # Structured accuracy data
      pr_scores.json     # Per-PR scoring detail
      dashboard.html     # Interactive HTML report

Dashboard

Live dashboards are deployed to Cloudflare Pages:

paperclipai/paperclip: https://gardener-report.pages.dev

Feedback loop

The most valuable output is the wrong calls list. Each wrong call becomes a case study for improving gardener. The flow:

gardener-bench scores a repo → finds wrong/partial calls
Wrong calls are filed as issues on repo-gardener with the accuracy-report label
Humans review and decide what to change in gardener's prompts or tree-reading
Re-run scoring after changes → watch accuracy % go up

This is a human-reviewed feedback loop, not auto-applied.

Engagement tracking

Beyond accuracy, the dashboard tracks how humans interact with gardener:

Engaging replies — comments that reference, quote, or respond to gardener's review
Signal detection — mentions-gardener, references-verdict, quotes-gardener, addresses-review
Engagement rate — what % of threads have human engagement with gardener

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
reports/paperclipai-paperclip		reports/paperclipai-paperclip
src		src
.gitignore		.gitignore
README.md		README.md
bench.sh		bench.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gardener-bench

What this does

How scoring works

Usage

Reports

Dashboard

Feedback loop

Engagement tracking

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gardener-bench

What this does

How scoring works

Usage

Reports

Dashboard

Feedback loop

Engagement tracking

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages