fix(triggers): redis-back review-dispatch dedup to close cross-process gap#1248
Merged
zbigniewsobiecki merged 1 commit intodevfrom May 1, 2026
Merged
Conversation
…s gap ucho/PR #194 (2026-05-01, CI fully green, single SHA) had two review runs dispatched 29 s apart — both `triggerType=ci-success`, same engine/model/ workItemId, both completed normally, both burned LLM tokens and posted reviews. Confirmed verbatim in Loki: identical `reviewDispatchKey` (`zbigniewsobiecki/ucho:194:9ed484df...`) appeared in claim logs from TWO processes — one from `agent-execution.ts:279` post-completion-hook running in the IMPL worker container, one from `check-suite-success.ts:240` running in the cascade-router. Both `claimReviewDispatch(key, ...)` returned `true` because the dedup `recentlyDispatched` Map at `review-dispatch-dedup.ts:5` was module-scoped, in-memory, per-process state. Two processes = two independent Maps = no dedup. Pure waste. PR #1246 explicitly called this out as out-of-scope ("With the architectural shift above, the in-memory dedup is sufficient. Skip."). That was wrong: the architectural shift in #1246 closes the worker-bail-out path but does nothing about the post-completion-hook path running in a different process. The bug class is broader than ucho/PR #194 — anywhere two processes might independently call `claimReviewDispatch` for the same (project, PR, SHA) is exposed: post-completion-hook ↔ check-suite-success (live now), review- requested ↔ post-completion-hook (waiting to bite), and any future horizontally-scaled router replicas. Architectural fix — move the dedup state to Redis so it's shared across all processes. Atomic primitive: `SET key value NX EX <ttl>` returns `'OK'` exactly once per key within the TTL, regardless of how many processes race it. No application-level locking, no race window. `src/triggers/github/review-dispatch-dedup.ts` — rewritten: - Drops the in-memory `Map<string, number>` and `cleanupExpiredEntries`. - Lazy IORedis singleton (mirrors `src/queue/cancel.ts:28-37`). - `claimReviewDispatch` / `releaseReviewDispatch` are now async. - Keys namespaced under `cascade:review-dedup:`. - Fail-closed on Redis errors: `claim` returns `false` and Sentry-captures under tag `review_dedup_redis_down`. Better to skip a legit dispatch than duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture. - 5-min TTL preserved from #1246. The pre-Redis "30-min TTL clears stale entries on router restart" CLAUDE.md note is now obsolete. - Test-only `__resetForTests()` flushes the namespace. Callers — minimal mechanical await additions: - `check-suite-success.ts:240` — `await claimReviewDispatch(...)`. - `check-suite-success.ts:274` — `onBlocked: () => { void releaseReviewDispatch(...) }` (callback signature is sync; fire-and-forget the release). - `review-requested.ts:101-102` — both calls become `await`. - `review-requested.ts:131` — same fire-and-forget shape. - `agent-execution.ts:279` (post-completion-hook) — `await`. Tests: - `tests/unit/triggers/github/review-dispatch-dedup.test.ts` rewritten around a vi.mock('ioredis', ...) factory whose closure captures a single in-memory store. Every `new Redis(...)` instance shares that backend, so the cross-process invariant is trivially testable: instantiate two IORedis clients and verify the second `SET NX EX` for the same key returns `null`. **This is the regression pin for ucho/PR #194.** - New "fails closed when Redis errors" test covers the Sentry-capture path under tag `review_dedup_redis_down`. - `check-suite-success.test.ts` and `review-requested.test.ts`: replaced the per-test `recentlyDispatched.clear()` with `vi.mock` of the dedup module. Tests that assert the dedup-skip path now use `mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false)` to simulate the SET NX EX race. Verification: - npx vitest run --project unit-triggers --project unit-api --project unit-core → 7678/7678 pass. - npm run typecheck + npm run lint clean. - Direct-instance test demonstrates two IORedis clients against the shared backend correctly reject the second claim — the production scenario pinned in code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nhopeatall
approved these changes
May 1, 2026
Collaborator
nhopeatall
left a comment
There was a problem hiding this comment.
Summary
Sound fix for a confirmed cross-process dedup gap. The Redis-backed SET NX EX approach is the right primitive, follows the existing src/queue/cancel.ts lazy-singleton pattern, and the fail-closed error handling aligns with the spec-017 posture established elsewhere. Tests directly pin the production incident scenario.
Should Fix
- tests/unit/triggers/shared/agent-execution.test.ts:82,1061,1163 (not in diff) — this test file still uses
mockReturnValue(true)/mockReturnValueOnce(false)formockClaimReviewDispatch, which is now an async function. It works becauseawait <non-Promise>wraps the value inPromise.resolve(), but it should bemockResolvedValue(true)/mockResolvedValueOnce(false)for consistency with the other two test files updated in this PR (check-suite-success.test.tsandreview-requested.test.ts). Worth fixing as a drive-by in this PR to keep the mock patterns uniform.
🕵️ claude-code · claude-opus-4-6 · run details
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This was referenced May 1, 2026
zbigniewsobiecki
added a commit
that referenced
this pull request
May 1, 2026
…crub (#1250) PR #1248's Redis-backed dedup throws inside cascade-worker processes: ERROR Review-dispatch dedup Redis call failed — failing closed error: 'Error: REDIS_URL is required for review-dispatch dedup' Confirmed live at 2026-05-01T17:25:44 in worker `cascade-worker-coalesce_ucho_MNG-461_*`. Sentry tag: `review_dedup_redis_down`. Root cause: `src/utils/envScrub.ts:13-18` lists REDIS_URL in `SENSITIVE_ENV_KEYS`. Worker-entry calls `scrubSensitiveEnv()` early in startup, which does `delete process.env.REDIS_URL` to keep infra secrets out of agent-spawned subprocesses. The dedup module reads `process.env.REDIS_URL` lazily at first dispatch — well after the scrub — so it sees `undefined` and throws. The fail-closed handler returns false, and the post-completion-hook silently logs `Skipping post-completion review: already dispatched` (which looks identical to the working dedup's "correctly rejected duplicate" log, hence the bug evaded the prod canary). Why it didn't bite immediately: the post-completion-hook short-circuits when CI isn't all passing at impl exit (`agent-execution.ts:269-275`), so the bug only triggers on the narrow timing where CI is fully green BEFORE impl finishes. Small fraction of runs. Why router is unaffected: cascade-router never calls scrubSensitiveEnv. PR #196's review (verified 1-run earlier) was dispatched from the router via check-suite-success — that path was never broken. Fix: read the URL via `routerConfig.redisUrl` instead of `process.env.REDIS_URL`. routerConfig is captured at module-load time in `src/router/config.ts:137-138`, well before scrubSensitiveEnv runs. The cached value survives the scrub the same way the DB pool's cached connection string does — exactly the pattern envScrub.ts's docstring describes. Two-line source change + new regression-pin test that claims, deletes process.env.REDIS_URL, and claims again — both must succeed (second deduped) without throwing `REDIS_URL is required`. Verification: vitest 2497/2497, typecheck clean, lint clean. Out of scope (follow-up): `src/queue/cancel.ts` has the same `process.env.REDIS_URL` lazy pattern but is only called from cascade-router and cascade-dashboard (neither scrubs env), so it doesn't bite today. Apply the same routerConfig swap for symmetry in a separate PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Move the review-dispatch dedup state from a per-process in-memory
Mapto Redis (SET key value NX EX <ttl>), so the dedup holds across all processes that participate in dispatch — router, IMPL workers (post-completion-hook), and any future horizontally-scaled router replicas.The incident this fixes — ucho/PR #194 (2026-05-01)
CI fully green, single SHA
9ed484dfcc9c95265d55ec9be449f47f813d26a7. Two review runs dispatched 29 s apart, bothtriggerType=ci-success, both completed normally, both burned LLM tokens.6e0fe8d0started 12:52:14cascade-worker-coalesce_ucho_MNG-453_*review (post-completion) completed { reviewDispatchKey: '...:194:9ed4...', trigger: 'post-completion-hook' }2847d5c3started 12:52:43Claimed review dispatch for PR+SHA reviewDispatchKey: '...:194:9ed4...'Identical dedup key. Both
claimReviewDispatch(key, ...)calls returnedtrue. Why? The pre-Redis dedup atsrc/triggers/github/review-dispatch-dedup.ts:5was a module-scopedrecentlyDispatched: Map<string, number>— per-process state. The IMPL worker process saw an empty Map and claimed; the router process saw an empty Map and also claimed. No cross-process synchronization at all.Why PR #1246 didn't catch this
#1246's plan explicitly noted this gap as out-of-scope:
That was wrong. The architectural shift in #1246 closes one path (worker-bail-out wedging the dedup); it does nothing about the post-completion-hook path running in a separate process. The bug class is broader than
post-completion-hook↔check-suite-success— anywhere two processes might independently callclaimReviewDispatchfor the same (project, PR, SHA) is exposed:Why also not the work-item lock?
CLAUDE.md notes a per-(projectId, workItemId, agentType) work-item lock at the BullMQ enqueue layer. It IS held in-memory in the router process (
src/router/active-workers.ts). It should have rejected the second enqueue.It didn't because the post-completion-hook enqueues from the IMPL worker process, bypassing the router's lock state entirely. The router's
recentlyDispatchedMap AND the router's work-item lock are both router-process-local. A worker-process enqueue is invisible to both. That's a follow-up for a separate PR (the work-item lock has more semantics — counters per agent-type, etc. — and lives inactive-workers.ts).The architectural fix
src/triggers/github/review-dispatch-dedup.ts— rewritten around Redis.Map<string, number>andcleanupExpiredEntries.src/queue/cancel.ts:28-37); firstclaimcall connects.claimReviewDispatch/releaseReviewDispatchare now async.cascade:review-dedup:so they don't collide with BullMQ or any other Redis user.claimreturnsfalse(treats the call as a duplicate) and Sentry-captures under tagreview_dedup_redis_down. Better to skip a legit dispatch than duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture.__resetForTests()flushes the namespace.Callers — minimal mechanical await additions:
check-suite-success.ts:240—await claimReviewDispatch(...).check-suite-success.ts:274—onBlocked: () => { void releaseReviewDispatch(...) }(callback signature is sync; fire-and-forget the release).review-requested.ts:101-102— both calls becomeawait.review-requested.ts:131— same fire-and-forget shape.agent-execution.ts:279(post-completion-hook) —await.Tests
tests/unit/triggers/github/review-dispatch-dedup.test.tsrewritten around avi.mock('ioredis', ...)factory whose closure captures a single in-memory store. Everynew Redis(...)instance shares that backend, so the cross-process invariant is trivially testable: instantiate two IORedis clients and verify the secondSET NX EXfor the same key returnsnull. This is the regression pin for ucho/PR fix(triggers): update initial PR comment on GitHub agent failure #194.fails closed when Redis errors, returning false and capturing to Sentrytest covers the Sentry-capture path under tagreview_dedup_redis_down.check-suite-success.test.tsandreview-requested.test.ts: replaced the per-testrecentlyDispatched.clear()withvi.mockof the dedup module, exposingmockClaimReviewDispatch/mockReleaseReviewDispatchspies. Tests that assert the dedup-skip path now usemockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false)to simulate the SET NX EX race.Out of scope (intentionally — flagged for follow-up)
Test plan
npx vitest run --project unit-triggers --project unit-api --project unit-core→ 7678/7678 passnpm run typecheckcleannpm run lintclean (13 pre-existing warnings, none on changed files)Review already dispatched for this PR+SHA, skippingfrom whichever path arrives second.🤖 Generated with Claude Code