Measures the latency cost of using GitHub's REST API as a state coordination mechanism in the Amber issue-handler workflow.
Using GitHub (issues, PRs, labels, checks) as the state machine for agentic workflows introduces measurable overhead. This benchmark quantifies:
- Per-step API latency — how long each GitHub REST call takes
- Polling tax — time wasted waiting for GitHub to compute state (e.g. PR mergeable)
- Rate limit pressure — how fast you burn through 5,000 req/hr under concurrent agents
- Concurrency degradation — how latency changes as parallel agents increase
- Event propagation delay — the gap between "state changed" and "state visible via API"
The goal is to prove that timer-based polling should be replaced with event-driven webhooks and that GitHub's API is a bottleneck worth measuring.
Mermaid Step → GitHub API Call → Type
─────────────────────────────────────────────────────────────────────
A: Issue + label → POST /issues + POST /labels → WRITE
B: Workflow dispatch → POST /dispatches → WRITE
C: ambient-action → (simulated delay) → COMPUTE
D: ACP session → (simulated delay) → COMPUTE
E: Amber2 workflow → (simulated delay) → COMPUTE
F: Load context → GET /repos + GET /issues → READ
G: Spec Kit → (simulated delay) → COMPUTE
H: Reproduce → (simulated delay) → COMPUTE
I: Implement fix → (simulated delay) → COMPUTE
J: Lint/test/coverage → POST + PATCH /check-runs → WRITE
K: Emit report → POST /issues/comments → WRITE
L: Push branch → POST /git/refs + PUT /contents → WRITE
M: Create PR → POST /pulls → WRITE
M: Poll mergeable → GET /pulls (repeated) → POLL ← event gap
N: Review → POST /pulls/reviews → WRITE
O: Merge → PUT /pulls/merge → WRITE
P: Post-merge learning → POST /issues/comments → WRITE
Q: Close issue → PATCH /issues → WRITE
Z: Cleanup → DELETE /git/refs → WRITE
Each WRITE call costs 5 points against the secondary rate limit (900 pts/min). A single cycle makes ~14 write calls = 70 points. At 900 pts/min, you hit the secondary rate limit at ~12 concurrent cycles/minute.
# Install
uv sync
# Set up test repo (creates repo, labels, workflow)
GITHUB_TOKEN=ghp_xxx GITHUB_ORG=your-test-org \
uv run python setup_repo.py
# Run benchmark (sequential then parallel ramp)
GITHUB_TOKEN=ghp_xxx GITHUB_ORG=your-test-org GITHUB_REPO=gh-api-benchmark \
uv run python benchmark.py
# Analyze results
uv run python analyze.py benchmark_results.db ./reports/Environment variables:
| Variable | Default | Description |
|---|---|---|
GITHUB_TOKEN |
(required) | PAT with repo+workflow+checks |
GITHUB_ORG |
(required) | Test org or username |
GITHUB_REPO |
gh-api-benchmark |
Test repo name |
SEQUENTIAL_CYCLES |
5 |
Baseline cycles (no concurrency) |
PARALLEL_CYCLES |
20 |
Total parallel cycles |
MAX_CONCURRENCY |
5 |
Max simultaneous cycles |
repo— full control of private reposworkflow— update GitHub Action workflowswrite:checks— create and update check runs
SQLite database (benchmark_results.db) with three tables:
step_results— every individual API call or simulated stepcycle_summary— per-cycle aggregates (wall time, API time, sim time)rate_limit_snapshots— periodic rate limit state
Analysis script produces:
benchmark_report.txt— human-readable summarystep_latency_stats.csv— per-step p50/p90/p99rate_limit_timeline.csv— rate limit consumption over timeall_step_results.csv— raw data for external tools
- Poll Tax % — what fraction of cycle time is spent polling for state
- GitHub Tax % — what fraction of non-compute time is API calls
- P99 latency per step — tail latency tells the real story
- Rate limit burn rate — projected req/hr extrapolated from the run
- Concurrency knee — at what parallelism level does latency spike
A Next.js app for launching runs, monitoring progress, and comparing results.
cd dashboard
npm install
# Required: pass the same GitHub credentials
GITHUB_TOKEN=ghp_xxx GITHUB_ORG=your-test-org GITHUB_REPO=gh-api-benchmark \
npx next dev -p 3001See dashboard/.env.example for required variables. The dashboard reads from
the same benchmark_results.db in the project root.
Features:
- Launch benchmark runs with preset or custom configurations
- Live monitoring — polls every 3s while a run is active
- Run detail — per-step latency charts, percentile distributions, rate limit timeline
- Compare — side-by-side metrics across multiple runs
benchmark.py — orchestrator + instrumented GitHub client
├── Phase 1 — sequential baseline (N cycles, 1 at a time)
├── Phase 2 — parallel ramp (1→MAX_CONCURRENCY workers)
└── SQLite writer — WAL mode, commit-per-cycle for crash safety
analyze.py — reads SQLite, produces stats + CSVs
setup_repo.py — one-time test repo initialization
dashboard/ — Next.js app for visualization and run management
├── src/app/api/ — API routes (runs CRUD, benchmark launcher, compare)
└── src/app/page.tsx — single-page app with runs list, detail, and compare views