Skip to content

Promote dev to main: defer check-suite-success until aggregate complete#1247

Merged
zbigniewsobiecki merged 2 commits intomainfrom
dev
May 1, 2026
Merged

Promote dev to main: defer check-suite-success until aggregate complete#1247
zbigniewsobiecki merged 2 commits intomainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Promotes 1 commit from `dev`:

What

Eliminates the worker-side CI polling layer in `check-suite-success`. The router now defers (skips) when aggregate check_runs aren't all complete; the LAST `check_suite.completed` event for a SHA makes the dispatch decision based on full aggregate state. Mirrors the existing `check-suite-failure.handle` shape — symmetric defer-on-incomplete, no asymmetric polling.

Closes the silent-skip bug where PR #1245 (CI fully green) never had a review run because the success handler dispatched on the FIRST event (CodeQL completing while CI's lint-and-test was still running), the worker bailed after a 2-min polling cap, and the dedup blocked all subsequent legitimate retries for 30 min.

Reviewed and approved on #1246 by @nhopeatall. CI green on dev (3/3 workflows expected — same shape as the prior dangling-image-cleanup promotion).

Risk

Low. Architectural shift, but scoped to a single trigger handler. Tests (955 unit-triggers + 1540 unit-api + 5174 unit-core) all pass. The new defer behavior is a strict subset of valid trigger outcomes: every PR that previously got a review still does (when state is allComplete), and a class of PRs that previously dropped their review now reliably get one on the LAST event.

Diff summary

  • `src/triggers/github/check-suite-success.ts` — three-way fork (defer / review / respond-to-ci); `waitForChecks` exported function + constants deleted; `waitForChecks: true` flag deleted from review TriggerResult.
  • `src/triggers/github/webhook-handler.ts` — worker-side polling block deleted.
  • `src/triggers/github/check-polling.ts` — deleted (no caller).
  • `src/types/index.ts` — `waitForChecks?: boolean` field deleted from TriggerResult.
  • `src/triggers/github/review-dispatch-dedup.ts` — TTL 30 min → 5 min.
  • Tests updated; `check-polling.test.ts` deleted.

🤖 Generated with Claude Code

aaight and others added 2 commits May 1, 2026 14:11
…ne (#1245)

Co-authored-by: Cascade Bot <bot@cascade.dev>
…plete (#1246)

PR #1245 (CI fully green) silently never had a review run. Root cause: the
success handler dispatched on the FIRST check_suite event (CodeQL completing
at 11:37:43) while CI's slower lint-and-test was still in-progress. The
worker spawned, polled 12×10s via `pollWaitForChecks` (`MAX_RETRIES=12 ×
RETRY_DELAY_MS=10s` = 2 min), bailed at 11:39:17 with `statusCode=0` and
`Not all checks passing after polling, skipping agent`. lint-and-test
completed cleanly 41 seconds later. Subsequent legitimate `check_suite-
completed` events for the SHA all hit the dedup at `check-suite-success.ts`
and were silently skipped — the worker's `onBlocked` cleanup ran in the
worker's own (empty) in-memory dedup map, never reaching the router process
that holds the actual claim. Net: zero review for a clean, mergeable PR;
30-minute window of stuck dedup before the entry aged out.

This is not a one-off — it bites every PR where multiple workflows emit
separate check_suites and the slower one takes >2 min after the faster
one completes. CodeQL completing first is the canonical case.

Architectural fix — eliminate worker-side polling. Move the dispatch
decision entirely to the router, gated on aggregate `allComplete`. The LAST
`check_suite.completed` event for a SHA — whichever polarity — makes the
dispatch decision based on full aggregate state. Mirrors the existing
`check-suite-failure.handle` defer-on-incomplete shape.

Three outcomes now in `check-suite-success.handle`:
  - !allComplete → skip with "Not all checks complete yet (X/Y still
                   running): <names>". The next check_suite event re-evaluates.
                   No worker spawned; no dedup claimed.
  - allComplete + anyFailed → dispatch respond-to-ci (existing #1243 fork).
  - allComplete + !anyFailed → dispatch review (no waitForChecks flag).

Eliminates both root causes simultaneously: nothing dispatches prematurely,
nothing has to "release on bail" because nothing bails. The asymmetry
between the success handler (polled) and failure handler (deferred) is gone.

Deleted:
- `src/triggers/github/check-polling.ts` and its test (no caller).
- `waitForChecks` exported function from check-suite-success (worker-side
  polling layer no longer needed; `MAX_RETRIES`/`RETRY_DELAY_MS` constants
  gone with it).
- `waitForChecks?: boolean` field from `TriggerResult` type — every caller
  is updated atomically by the type deletion.
- `if (result.waitForChecks)` block in webhook-handler.ts (the worker-side
  polling call site + orphan-ack-comment cleanup).
- 3 tests in `github-webhook-handler.test.ts` covering the deleted polling
  code path.

Tightened: review-dispatch-dedup TTL 30 min → 5 min. With the new defer
behavior, dispatches correlate with actually-running workers, so the longer
TTL had no defensive value and amplified the wedged-state incident.

Tests: existing `check-suite-success.test.ts` updated — 5+ assertions of
`waitForChecks: true` flipped to `.toBeUndefined()`; obsolete
`describe('waitForChecks')` block (4 tests) deleted; 2 NEW defer-on-
incomplete cases added (the regression pin for the PR #1245 incident +
a release-no-dedup-claim guard).

Verification:
- npx vitest run --project unit-triggers → 955/955
- npx vitest run --project unit-api      → 1540/1540
- npx vitest run --project unit-core     → 5174/5174
- npm run typecheck + npm run lint clean

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit d470d8b into main May 1, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants