Promote dev to main: defer check-suite-success until aggregate complete#1247
Merged
zbigniewsobiecki merged 2 commits intomainfrom May 1, 2026
Merged
Promote dev to main: defer check-suite-success until aggregate complete#1247zbigniewsobiecki merged 2 commits intomainfrom
zbigniewsobiecki merged 2 commits intomainfrom
Conversation
…ne (#1245) Co-authored-by: Cascade Bot <bot@cascade.dev>
…plete (#1246) PR #1245 (CI fully green) silently never had a review run. Root cause: the success handler dispatched on the FIRST check_suite event (CodeQL completing at 11:37:43) while CI's slower lint-and-test was still in-progress. The worker spawned, polled 12×10s via `pollWaitForChecks` (`MAX_RETRIES=12 × RETRY_DELAY_MS=10s` = 2 min), bailed at 11:39:17 with `statusCode=0` and `Not all checks passing after polling, skipping agent`. lint-and-test completed cleanly 41 seconds later. Subsequent legitimate `check_suite- completed` events for the SHA all hit the dedup at `check-suite-success.ts` and were silently skipped — the worker's `onBlocked` cleanup ran in the worker's own (empty) in-memory dedup map, never reaching the router process that holds the actual claim. Net: zero review for a clean, mergeable PR; 30-minute window of stuck dedup before the entry aged out. This is not a one-off — it bites every PR where multiple workflows emit separate check_suites and the slower one takes >2 min after the faster one completes. CodeQL completing first is the canonical case. Architectural fix — eliminate worker-side polling. Move the dispatch decision entirely to the router, gated on aggregate `allComplete`. The LAST `check_suite.completed` event for a SHA — whichever polarity — makes the dispatch decision based on full aggregate state. Mirrors the existing `check-suite-failure.handle` defer-on-incomplete shape. Three outcomes now in `check-suite-success.handle`: - !allComplete → skip with "Not all checks complete yet (X/Y still running): <names>". The next check_suite event re-evaluates. No worker spawned; no dedup claimed. - allComplete + anyFailed → dispatch respond-to-ci (existing #1243 fork). - allComplete + !anyFailed → dispatch review (no waitForChecks flag). Eliminates both root causes simultaneously: nothing dispatches prematurely, nothing has to "release on bail" because nothing bails. The asymmetry between the success handler (polled) and failure handler (deferred) is gone. Deleted: - `src/triggers/github/check-polling.ts` and its test (no caller). - `waitForChecks` exported function from check-suite-success (worker-side polling layer no longer needed; `MAX_RETRIES`/`RETRY_DELAY_MS` constants gone with it). - `waitForChecks?: boolean` field from `TriggerResult` type — every caller is updated atomically by the type deletion. - `if (result.waitForChecks)` block in webhook-handler.ts (the worker-side polling call site + orphan-ack-comment cleanup). - 3 tests in `github-webhook-handler.test.ts` covering the deleted polling code path. Tightened: review-dispatch-dedup TTL 30 min → 5 min. With the new defer behavior, dispatches correlate with actually-running workers, so the longer TTL had no defensive value and amplified the wedged-state incident. Tests: existing `check-suite-success.test.ts` updated — 5+ assertions of `waitForChecks: true` flipped to `.toBeUndefined()`; obsolete `describe('waitForChecks')` block (4 tests) deleted; 2 NEW defer-on- incomplete cases added (the regression pin for the PR #1245 incident + a release-no-dedup-claim guard). Verification: - npx vitest run --project unit-triggers → 955/955 - npx vitest run --project unit-api → 1540/1540 - npx vitest run --project unit-core → 5174/5174 - npm run typecheck + npm run lint clean Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes 1 commit from `dev`:
What
Eliminates the worker-side CI polling layer in `check-suite-success`. The router now defers (skips) when aggregate check_runs aren't all complete; the LAST `check_suite.completed` event for a SHA makes the dispatch decision based on full aggregate state. Mirrors the existing `check-suite-failure.handle` shape — symmetric defer-on-incomplete, no asymmetric polling.
Closes the silent-skip bug where PR #1245 (CI fully green) never had a review run because the success handler dispatched on the FIRST event (CodeQL completing while CI's lint-and-test was still running), the worker bailed after a 2-min polling cap, and the dedup blocked all subsequent legitimate retries for 30 min.
Reviewed and approved on #1246 by @nhopeatall. CI green on dev (3/3 workflows expected — same shape as the prior dangling-image-cleanup promotion).
Risk
Low. Architectural shift, but scoped to a single trigger handler. Tests (955 unit-triggers + 1540 unit-api + 5174 unit-core) all pass. The new defer behavior is a strict subset of valid trigger outcomes: every PR that previously got a review still does (when state is allComplete), and a class of PRs that previously dropped their review now reliably get one on the LAST event.
Diff summary
🤖 Generated with Claude Code