fix(triggers): redis-back review-dispatch dedup to close cross-process gap by zbigniewsobiecki · Pull Request #1248 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-05-01T13:26:42Z

What

Move the review-dispatch dedup state from a per-process in-memory Map to Redis (SET key value NX EX <ttl>), so the dedup holds across all processes that participate in dispatch — router, IMPL workers (post-completion-hook), and any future horizontally-scaled router replicas.

The incident this fixes — ucho/PR #194 (2026-05-01)

CI fully green, single SHA 9ed484dfcc9c95265d55ec9be449f47f813d26a7. Two review runs dispatched 29 s apart, both triggerType=ci-success, both completed normally, both burned LLM tokens.

Run	Container (process)	Trigger path	Loki signal
`6e0fe8d0` started 12:52:14	IMPL worker `cascade-worker-coalesce_ucho_MNG-453_*`	post-completion-hook	`review (post-completion) completed { reviewDispatchKey: '...:194:9ed4...', trigger: 'post-completion-hook' }`
`2847d5c3` started 12:52:43	cascade-router	check-suite-success	`Claimed review dispatch for PR+SHA reviewDispatchKey: '...:194:9ed4...'`

Identical dedup key. Both claimReviewDispatch(key, ...) calls returned true. Why? The pre-Redis dedup at src/triggers/github/review-dispatch-dedup.ts:5 was a module-scoped recentlyDispatched: Map<string, number> — per-process state. The IMPL worker process saw an empty Map and claimed; the router process saw an empty Map and also claimed. No cross-process synchronization at all.

Why PR #1246 didn't catch this

#1246's plan explicitly noted this gap as out-of-scope:

Cross-process dedup via Redis: would also fix Root cause B in isolation but doesn't help Root cause A. With the architectural shift above, the in-memory dedup is sufficient. Skip.

That was wrong. The architectural shift in #1246 closes one path (worker-bail-out wedging the dedup); it does nothing about the post-completion-hook path running in a separate process. The bug class is broader than post-completion-hook ↔ check-suite-success — anywhere two processes might independently call claimReviewDispatch for the same (project, PR, SHA) is exposed:

post-completion-hook (worker) ↔ check-suite-success (router) — confirmed live on PR fix(triggers): update initial PR comment on GitHub agent failure #194.
review-requested (router) ↔ post-completion-hook (worker) — would also break.
Multi-replica router deployments — when prod scales horizontally, the design assumption "single router process" is undocumented and fragile.

Why also not the work-item lock?

CLAUDE.md notes a per-(projectId, workItemId, agentType) work-item lock at the BullMQ enqueue layer. It IS held in-memory in the router process (src/router/active-workers.ts). It should have rejected the second enqueue.

It didn't because the post-completion-hook enqueues from the IMPL worker process, bypassing the router's lock state entirely. The router's recentlyDispatched Map AND the router's work-item lock are both router-process-local. A worker-process enqueue is invisible to both. That's a follow-up for a separate PR (the work-item lock has more semantics — counters per agent-type, etc. — and lives in active-workers.ts).

The architectural fix

src/triggers/github/review-dispatch-dedup.ts — rewritten around Redis.

Drops the in-memory Map<string, number> and cleanupExpiredEntries.
Lazy IORedis singleton (mirrors src/queue/cancel.ts:28-37); first claim call connects.
claimReviewDispatch / releaseReviewDispatch are now async.
Keys namespaced under cascade:review-dedup: so they don't collide with BullMQ or any other Redis user.
Fail-closed on Redis errors: claim returns false (treats the call as a duplicate) and Sentry-captures under tag review_dedup_redis_down. Better to skip a legit dispatch than duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture.
5-min TTL preserved from fix(triggers): defer check-suite-success until aggregate state is complete #1246.
Test-only __resetForTests() flushes the namespace.

Callers — minimal mechanical await additions:

check-suite-success.ts:240 — await claimReviewDispatch(...).
check-suite-success.ts:274 — onBlocked: () => { void releaseReviewDispatch(...) } (callback signature is sync; fire-and-forget the release).
review-requested.ts:101-102 — both calls become await.
review-requested.ts:131 — same fire-and-forget shape.
agent-execution.ts:279 (post-completion-hook) — await.

Tests

tests/unit/triggers/github/review-dispatch-dedup.test.ts rewritten around a vi.mock('ioredis', ...) factory whose closure captures a single in-memory store. Every new Redis(...) instance shares that backend, so the cross-process invariant is trivially testable: instantiate two IORedis clients and verify the second SET NX EX for the same key returns null. This is the regression pin for ucho/PR fix(triggers): update initial PR comment on GitHub agent failure #194.
New fails closed when Redis errors, returning false and capturing to Sentry test covers the Sentry-capture path under tag review_dedup_redis_down.
check-suite-success.test.ts and review-requested.test.ts: replaced the per-test recentlyDispatched.clear() with vi.mock of the dedup module, exposing mockClaimReviewDispatch / mockReleaseReviewDispatch spies. Tests that assert the dedup-skip path now use mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false) to simulate the SET NX EX race.

Out of scope (intentionally — flagged for follow-up)

Move work-item lock to Redis. Same class of cross-process gap, larger surface (counters, TTL semantics, cross-cutting with orphan cleanup). Defer; revisit if the same bug class recurs.
Backfill dedup state on router restart. Not needed — the 5-min TTL means any stale claim self-clears within minutes.
Cross-region Redis. Single-region deployment today.

Test plan

npx vitest run --project unit-triggers --project unit-api --project unit-core → 7678/7678 pass
npm run typecheck clean
npm run lint clean (13 pre-existing warnings, none on changed files)
CI green
Reviewer approval (@nhopeatall)
After deploy: replay the PR fix(triggers): update initial PR comment on GitHub agent failure #194 scenario by triggering an impl agent via Linear status-change AND letting CI fire. Confirm only ONE review run is dispatched; Loki should show Review already dispatched for this PR+SHA, skipping from whichever path arrives second.

🤖 Generated with Claude Code

…s gap ucho/PR #194 (2026-05-01, CI fully green, single SHA) had two review runs dispatched 29 s apart — both `triggerType=ci-success`, same engine/model/ workItemId, both completed normally, both burned LLM tokens and posted reviews. Confirmed verbatim in Loki: identical `reviewDispatchKey` (`zbigniewsobiecki/ucho:194:9ed484df...`) appeared in claim logs from TWO processes — one from `agent-execution.ts:279` post-completion-hook running in the IMPL worker container, one from `check-suite-success.ts:240` running in the cascade-router. Both `claimReviewDispatch(key, ...)` returned `true` because the dedup `recentlyDispatched` Map at `review-dispatch-dedup.ts:5` was module-scoped, in-memory, per-process state. Two processes = two independent Maps = no dedup. Pure waste. PR #1246 explicitly called this out as out-of-scope ("With the architectural shift above, the in-memory dedup is sufficient. Skip."). That was wrong: the architectural shift in #1246 closes the worker-bail-out path but does nothing about the post-completion-hook path running in a different process. The bug class is broader than ucho/PR #194 — anywhere two processes might independently call `claimReviewDispatch` for the same (project, PR, SHA) is exposed: post-completion-hook ↔ check-suite-success (live now), review- requested ↔ post-completion-hook (waiting to bite), and any future horizontally-scaled router replicas. Architectural fix — move the dedup state to Redis so it's shared across all processes. Atomic primitive: `SET key value NX EX <ttl>` returns `'OK'` exactly once per key within the TTL, regardless of how many processes race it. No application-level locking, no race window. `src/triggers/github/review-dispatch-dedup.ts` — rewritten: - Drops the in-memory `Map<string, number>` and `cleanupExpiredEntries`. - Lazy IORedis singleton (mirrors `src/queue/cancel.ts:28-37`). - `claimReviewDispatch` / `releaseReviewDispatch` are now async. - Keys namespaced under `cascade:review-dedup:`. - Fail-closed on Redis errors: `claim` returns `false` and Sentry-captures under tag `review_dedup_redis_down`. Better to skip a legit dispatch than duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture. - 5-min TTL preserved from #1246. The pre-Redis "30-min TTL clears stale entries on router restart" CLAUDE.md note is now obsolete. - Test-only `__resetForTests()` flushes the namespace. Callers — minimal mechanical await additions: - `check-suite-success.ts:240` — `await claimReviewDispatch(...)`. - `check-suite-success.ts:274` — `onBlocked: () => { void releaseReviewDispatch(...) }` (callback signature is sync; fire-and-forget the release). - `review-requested.ts:101-102` — both calls become `await`. - `review-requested.ts:131` — same fire-and-forget shape. - `agent-execution.ts:279` (post-completion-hook) — `await`. Tests: - `tests/unit/triggers/github/review-dispatch-dedup.test.ts` rewritten around a vi.mock('ioredis', ...) factory whose closure captures a single in-memory store. Every `new Redis(...)` instance shares that backend, so the cross-process invariant is trivially testable: instantiate two IORedis clients and verify the second `SET NX EX` for the same key returns `null`. **This is the regression pin for ucho/PR #194.** - New "fails closed when Redis errors" test covers the Sentry-capture path under tag `review_dedup_redis_down`. - `check-suite-success.test.ts` and `review-requested.test.ts`: replaced the per-test `recentlyDispatched.clear()` with `vi.mock` of the dedup module. Tests that assert the dedup-skip path now use `mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false)` to simulate the SET NX EX race. Verification: - npx vitest run --project unit-triggers --project unit-api --project unit-core → 7678/7678 pass. - npm run typecheck + npm run lint clean. - Direct-instance test demonstrates two IORedis clients against the shared backend correctly reject the second claim — the production scenario pinned in code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nhopeatall

Summary

Sound fix for a confirmed cross-process dedup gap. The Redis-backed SET NX EX approach is the right primitive, follows the existing src/queue/cancel.ts lazy-singleton pattern, and the fail-closed error handling aligns with the spec-017 posture established elsewhere. Tests directly pin the production incident scenario.

Should Fix

tests/unit/triggers/shared/agent-execution.test.ts:82,1061,1163 (not in diff) — this test file still uses mockReturnValue(true) / mockReturnValueOnce(false) for mockClaimReviewDispatch, which is now an async function. It works because await <non-Promise> wraps the value in Promise.resolve(), but it should be mockResolvedValue(true) / mockResolvedValueOnce(false) for consistency with the other two test files updated in this PR (check-suite-success.test.ts and review-requested.test.ts). Worth fixing as a drive-by in this PR to keep the mock patterns uniform.

🕵️ claude-code · claude-opus-4-6 · run details

codecov · 2026-05-01T13:31:48Z

Codecov Report

❌ Patch coverage is 82.35294% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/triggers/github/review-dispatch-dedup.ts	84.48%	9 Missing ⚠️
src/triggers/github/review-requested.ts	40.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

…crub (#1250) PR #1248's Redis-backed dedup throws inside cascade-worker processes: ERROR Review-dispatch dedup Redis call failed — failing closed error: 'Error: REDIS_URL is required for review-dispatch dedup' Confirmed live at 2026-05-01T17:25:44 in worker `cascade-worker-coalesce_ucho_MNG-461_*`. Sentry tag: `review_dedup_redis_down`. Root cause: `src/utils/envScrub.ts:13-18` lists REDIS_URL in `SENSITIVE_ENV_KEYS`. Worker-entry calls `scrubSensitiveEnv()` early in startup, which does `delete process.env.REDIS_URL` to keep infra secrets out of agent-spawned subprocesses. The dedup module reads `process.env.REDIS_URL` lazily at first dispatch — well after the scrub — so it sees `undefined` and throws. The fail-closed handler returns false, and the post-completion-hook silently logs `Skipping post-completion review: already dispatched` (which looks identical to the working dedup's "correctly rejected duplicate" log, hence the bug evaded the prod canary). Why it didn't bite immediately: the post-completion-hook short-circuits when CI isn't all passing at impl exit (`agent-execution.ts:269-275`), so the bug only triggers on the narrow timing where CI is fully green BEFORE impl finishes. Small fraction of runs. Why router is unaffected: cascade-router never calls scrubSensitiveEnv. PR #196's review (verified 1-run earlier) was dispatched from the router via check-suite-success — that path was never broken. Fix: read the URL via `routerConfig.redisUrl` instead of `process.env.REDIS_URL`. routerConfig is captured at module-load time in `src/router/config.ts:137-138`, well before scrubSensitiveEnv runs. The cached value survives the scrub the same way the DB pool's cached connection string does — exactly the pattern envScrub.ts's docstring describes. Two-line source change + new regression-pin test that claims, deletes process.env.REDIS_URL, and claims again — both must succeed (second deduped) without throwing `REDIS_URL is required`. Verification: vitest 2497/2497, typecheck clean, lint clean. Out of scope (follow-up): `src/queue/cancel.ts` has the same `process.env.REDIS_URL` lazy pattern but is only called from cascade-router and cascade-dashboard (neither scrubs env), so it doesn't bite today. Apply the same routerConfig swap for symmetry in a separate PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zbigniewsobiecki requested a review from nhopeatall May 1, 2026 13:26

nhopeatall approved these changes May 1, 2026

View reviewed changes

zbigniewsobiecki merged commit aa33d99 into dev May 1, 2026
9 checks passed

zbigniewsobiecki deleted the fix/review-dispatch-dedup-cross-process branch May 1, 2026 13:32

This was referenced May 1, 2026

Promote dev to main: redis-back review-dispatch dedup #1249

Merged

fix(triggers): read REDIS_URL via routerConfig so dedup survives envScrub #1250

Merged

zbigniewsobiecki mentioned this pull request May 1, 2026

Promote dev to main: dedup reads REDIS_URL via routerConfig (envScrub regression fix) #1251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(triggers): redis-back review-dispatch dedup to close cross-process gap#1248

fix(triggers): redis-back review-dispatch dedup to close cross-process gap#1248
zbigniewsobiecki merged 1 commit intodevfrom
fix/review-dispatch-dedup-cross-process

zbigniewsobiecki commented May 1, 2026

Uh oh!

nhopeatall left a comment

Uh oh!

codecov Bot commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zbigniewsobiecki commented May 1, 2026

What

The incident this fixes — ucho/PR #194 (2026-05-01)

Why PR #1246 didn't catch this

Why also not the work-item lock?

The architectural fix

Tests

Out of scope (intentionally — flagged for follow-up)

Test plan

Uh oh!

nhopeatall left a comment

Choose a reason for hiding this comment

Summary

Should Fix

Uh oh!

codecov Bot commented May 1, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants