Skip to content

agent-team-foundation/gardener-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gardener-bench

Live Dashboard

Verdict accuracy benchmarking and feedback dashboard for repo-gardener.

Live dashboard: https://gardener-report.pages.dev

What this does

Gardener posts verdicts (ALIGNED, NEEDS_REVIEW, CONFLICT, etc.) on PRs and issues. But how accurate are those verdicts? This tool measures that by comparing gardener's calls against what actually happened.

One number: Verdict Accuracy % — the percentage of gardener verdicts that match the maintainer's actual decision.

How scoring works

Each PR gets a verdict (what gardener said) and an outcome (what actually happened):

Gardener said Merged cleanly Merged after revision Maintainer rejected
ALIGNED Correct Partial Wrong
NEEDS_REVIEW Wrong (false alarm) Correct Correct
CONFLICT Wrong (false alarm) Correct Correct

Excluded from scoring:

  • Author-withdrawn PRs — the author closed their own PR. Gardener can't predict humans leaving.
  • Pending PRs — still open, no outcome yet. Scored when re-run later.

Formula: (correct + 0.5 × partial) / scorable × 100

Usage

# Score a repo where gardener has commented
python3 src/score.py --repo owner/repo

# Build the HTML dashboard
python3 src/build_dashboard.py --repo owner/repo

# Run both (fetch + score + build + deploy)
./bench.sh owner/repo

Reports

Each scored repo gets a directory under reports/:

reports/
  paperclipai-paperclip/
    2026-04-12/
      accuracy.json      # Structured accuracy data
      pr_scores.json     # Per-PR scoring detail
      dashboard.html     # Interactive HTML report

Dashboard

Live dashboards are deployed to Cloudflare Pages:

Feedback loop

The most valuable output is the wrong calls list. Each wrong call becomes a case study for improving gardener. The flow:

  1. gardener-bench scores a repo → finds wrong/partial calls
  2. Wrong calls are filed as issues on repo-gardener with the accuracy-report label
  3. Humans review and decide what to change in gardener's prompts or tree-reading
  4. Re-run scoring after changes → watch accuracy % go up

This is a human-reviewed feedback loop, not auto-applied.

Engagement tracking

Beyond accuracy, the dashboard tracks how humans interact with gardener:

  • Engaging replies — comments that reference, quote, or respond to gardener's review
  • Signal detection — mentions-gardener, references-verdict, quotes-gardener, addresses-review
  • Engagement rate — what % of threads have human engagement with gardener

Related

About

Verdict accuracy benchmarking and feedback dashboard for repo-gardener. Live: https://gardener-report.pages.dev

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages