Skip to content

fix(triggers): redis-back review-dispatch dedup to close cross-process gap#1248

Merged
zbigniewsobiecki merged 1 commit intodevfrom
fix/review-dispatch-dedup-cross-process
May 1, 2026
Merged

fix(triggers): redis-back review-dispatch dedup to close cross-process gap#1248
zbigniewsobiecki merged 1 commit intodevfrom
fix/review-dispatch-dedup-cross-process

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

What

Move the review-dispatch dedup state from a per-process in-memory Map to Redis (SET key value NX EX <ttl>), so the dedup holds across all processes that participate in dispatch — router, IMPL workers (post-completion-hook), and any future horizontally-scaled router replicas.

The incident this fixes — ucho/PR #194 (2026-05-01)

CI fully green, single SHA 9ed484dfcc9c95265d55ec9be449f47f813d26a7. Two review runs dispatched 29 s apart, both triggerType=ci-success, both completed normally, both burned LLM tokens.

Run Container (process) Trigger path Loki signal
6e0fe8d0 started 12:52:14 IMPL worker cascade-worker-coalesce_ucho_MNG-453_* post-completion-hook review (post-completion) completed { reviewDispatchKey: '...:194:9ed4...', trigger: 'post-completion-hook' }
2847d5c3 started 12:52:43 cascade-router check-suite-success Claimed review dispatch for PR+SHA reviewDispatchKey: '...:194:9ed4...'

Identical dedup key. Both claimReviewDispatch(key, ...) calls returned true. Why? The pre-Redis dedup at src/triggers/github/review-dispatch-dedup.ts:5 was a module-scoped recentlyDispatched: Map<string, number>per-process state. The IMPL worker process saw an empty Map and claimed; the router process saw an empty Map and also claimed. No cross-process synchronization at all.

Why PR #1246 didn't catch this

#1246's plan explicitly noted this gap as out-of-scope:

Cross-process dedup via Redis: would also fix Root cause B in isolation but doesn't help Root cause A. With the architectural shift above, the in-memory dedup is sufficient. Skip.

That was wrong. The architectural shift in #1246 closes one path (worker-bail-out wedging the dedup); it does nothing about the post-completion-hook path running in a separate process. The bug class is broader than post-completion-hookcheck-suite-success — anywhere two processes might independently call claimReviewDispatch for the same (project, PR, SHA) is exposed:

  • post-completion-hook (worker) ↔ check-suite-success (router) — confirmed live on PR fix(triggers): update initial PR comment on GitHub agent failure #194.
  • review-requested (router) ↔ post-completion-hook (worker) — would also break.
  • Multi-replica router deployments — when prod scales horizontally, the design assumption "single router process" is undocumented and fragile.

Why also not the work-item lock?

CLAUDE.md notes a per-(projectId, workItemId, agentType) work-item lock at the BullMQ enqueue layer. It IS held in-memory in the router process (src/router/active-workers.ts). It should have rejected the second enqueue.

It didn't because the post-completion-hook enqueues from the IMPL worker process, bypassing the router's lock state entirely. The router's recentlyDispatched Map AND the router's work-item lock are both router-process-local. A worker-process enqueue is invisible to both. That's a follow-up for a separate PR (the work-item lock has more semantics — counters per agent-type, etc. — and lives in active-workers.ts).

The architectural fix

src/triggers/github/review-dispatch-dedup.ts — rewritten around Redis.

  • Drops the in-memory Map<string, number> and cleanupExpiredEntries.
  • Lazy IORedis singleton (mirrors src/queue/cancel.ts:28-37); first claim call connects.
  • claimReviewDispatch / releaseReviewDispatch are now async.
  • Keys namespaced under cascade:review-dedup: so they don't collide with BullMQ or any other Redis user.
  • Fail-closed on Redis errors: claim returns false (treats the call as a duplicate) and Sentry-captures under tag review_dedup_redis_down. Better to skip a legit dispatch than duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture.
  • 5-min TTL preserved from fix(triggers): defer check-suite-success until aggregate state is complete #1246.
  • Test-only __resetForTests() flushes the namespace.

Callers — minimal mechanical await additions:

  • check-suite-success.ts:240await claimReviewDispatch(...).
  • check-suite-success.ts:274onBlocked: () => { void releaseReviewDispatch(...) } (callback signature is sync; fire-and-forget the release).
  • review-requested.ts:101-102 — both calls become await.
  • review-requested.ts:131 — same fire-and-forget shape.
  • agent-execution.ts:279 (post-completion-hook) — await.

Tests

  • tests/unit/triggers/github/review-dispatch-dedup.test.ts rewritten around a vi.mock('ioredis', ...) factory whose closure captures a single in-memory store. Every new Redis(...) instance shares that backend, so the cross-process invariant is trivially testable: instantiate two IORedis clients and verify the second SET NX EX for the same key returns null. This is the regression pin for ucho/PR fix(triggers): update initial PR comment on GitHub agent failure #194.
  • New fails closed when Redis errors, returning false and capturing to Sentry test covers the Sentry-capture path under tag review_dedup_redis_down.
  • check-suite-success.test.ts and review-requested.test.ts: replaced the per-test recentlyDispatched.clear() with vi.mock of the dedup module, exposing mockClaimReviewDispatch / mockReleaseReviewDispatch spies. Tests that assert the dedup-skip path now use mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false) to simulate the SET NX EX race.

Out of scope (intentionally — flagged for follow-up)

  • Move work-item lock to Redis. Same class of cross-process gap, larger surface (counters, TTL semantics, cross-cutting with orphan cleanup). Defer; revisit if the same bug class recurs.
  • Backfill dedup state on router restart. Not needed — the 5-min TTL means any stale claim self-clears within minutes.
  • Cross-region Redis. Single-region deployment today.

Test plan

  • npx vitest run --project unit-triggers --project unit-api --project unit-core7678/7678 pass
  • npm run typecheck clean
  • npm run lint clean (13 pre-existing warnings, none on changed files)
  • CI green
  • Reviewer approval (@nhopeatall)
  • After deploy: replay the PR fix(triggers): update initial PR comment on GitHub agent failure #194 scenario by triggering an impl agent via Linear status-change AND letting CI fire. Confirm only ONE review run is dispatched; Loki should show Review already dispatched for this PR+SHA, skipping from whichever path arrives second.

🤖 Generated with Claude Code

…s gap

ucho/PR #194 (2026-05-01, CI fully green, single SHA) had two review runs
dispatched 29 s apart — both `triggerType=ci-success`, same engine/model/
workItemId, both completed normally, both burned LLM tokens and posted
reviews. Confirmed verbatim in Loki: identical `reviewDispatchKey`
(`zbigniewsobiecki/ucho:194:9ed484df...`) appeared in claim logs from TWO
processes — one from `agent-execution.ts:279` post-completion-hook running
in the IMPL worker container, one from `check-suite-success.ts:240` running
in the cascade-router. Both `claimReviewDispatch(key, ...)` returned `true`
because the dedup `recentlyDispatched` Map at `review-dispatch-dedup.ts:5`
was module-scoped, in-memory, per-process state. Two processes = two
independent Maps = no dedup. Pure waste.

PR #1246 explicitly called this out as out-of-scope ("With the architectural
shift above, the in-memory dedup is sufficient. Skip."). That was wrong:
the architectural shift in #1246 closes the worker-bail-out path but does
nothing about the post-completion-hook path running in a different process.
The bug class is broader than ucho/PR #194 — anywhere two processes might
independently call `claimReviewDispatch` for the same (project, PR, SHA) is
exposed: post-completion-hook ↔ check-suite-success (live now), review-
requested ↔ post-completion-hook (waiting to bite), and any future
horizontally-scaled router replicas.

Architectural fix — move the dedup state to Redis so it's shared across all
processes. Atomic primitive: `SET key value NX EX <ttl>` returns `'OK'`
exactly once per key within the TTL, regardless of how many processes race
it. No application-level locking, no race window.

`src/triggers/github/review-dispatch-dedup.ts` — rewritten:
- Drops the in-memory `Map<string, number>` and `cleanupExpiredEntries`.
- Lazy IORedis singleton (mirrors `src/queue/cancel.ts:28-37`).
- `claimReviewDispatch` / `releaseReviewDispatch` are now async.
- Keys namespaced under `cascade:review-dedup:`.
- Fail-closed on Redis errors: `claim` returns `false` and Sentry-captures
  under tag `review_dedup_redis_down`. Better to skip a legit dispatch than
  duplicate. Mirrors spec-017 fail-closed pipeline-capacity-gate posture.
- 5-min TTL preserved from #1246. The pre-Redis "30-min TTL clears stale
  entries on router restart" CLAUDE.md note is now obsolete.
- Test-only `__resetForTests()` flushes the namespace.

Callers — minimal mechanical await additions:
- `check-suite-success.ts:240` — `await claimReviewDispatch(...)`.
- `check-suite-success.ts:274` — `onBlocked: () => { void releaseReviewDispatch(...) }`
  (callback signature is sync; fire-and-forget the release).
- `review-requested.ts:101-102` — both calls become `await`.
- `review-requested.ts:131` — same fire-and-forget shape.
- `agent-execution.ts:279` (post-completion-hook) — `await`.

Tests:
- `tests/unit/triggers/github/review-dispatch-dedup.test.ts` rewritten
  around a vi.mock('ioredis', ...) factory whose closure captures a single
  in-memory store. Every `new Redis(...)` instance shares that backend, so
  the cross-process invariant is trivially testable: instantiate two
  IORedis clients and verify the second `SET NX EX` for the same key
  returns `null`. **This is the regression pin for ucho/PR #194.**
- New "fails closed when Redis errors" test covers the Sentry-capture
  path under tag `review_dedup_redis_down`.
- `check-suite-success.test.ts` and `review-requested.test.ts`: replaced
  the per-test `recentlyDispatched.clear()` with `vi.mock` of the dedup
  module. Tests that assert the dedup-skip path now use
  `mockClaimReviewDispatch.mockResolvedValueOnce(true).mockResolvedValueOnce(false)`
  to simulate the SET NX EX race.

Verification:
- npx vitest run --project unit-triggers --project unit-api --project unit-core
  → 7678/7678 pass.
- npm run typecheck + npm run lint clean.
- Direct-instance test demonstrates two IORedis clients against the shared
  backend correctly reject the second claim — the production scenario
  pinned in code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nhopeatall nhopeatall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Sound fix for a confirmed cross-process dedup gap. The Redis-backed SET NX EX approach is the right primitive, follows the existing src/queue/cancel.ts lazy-singleton pattern, and the fail-closed error handling aligns with the spec-017 posture established elsewhere. Tests directly pin the production incident scenario.

Should Fix

  • tests/unit/triggers/shared/agent-execution.test.ts:82,1061,1163 (not in diff) — this test file still uses mockReturnValue(true) / mockReturnValueOnce(false) for mockClaimReviewDispatch, which is now an async function. It works because await <non-Promise> wraps the value in Promise.resolve(), but it should be mockResolvedValue(true) / mockResolvedValueOnce(false) for consistency with the other two test files updated in this PR (check-suite-success.test.ts and review-requested.test.ts). Worth fixing as a drive-by in this PR to keep the mock patterns uniform.

🕵️ claude-code · claude-opus-4-6 · run details

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

❌ Patch coverage is 82.35294% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/triggers/github/review-dispatch-dedup.ts 84.48% 9 Missing ⚠️
src/triggers/github/review-requested.ts 40.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zbigniewsobiecki zbigniewsobiecki merged commit aa33d99 into dev May 1, 2026
9 checks passed
@zbigniewsobiecki zbigniewsobiecki deleted the fix/review-dispatch-dedup-cross-process branch May 1, 2026 13:32
zbigniewsobiecki added a commit that referenced this pull request May 1, 2026
…crub (#1250)

PR #1248's Redis-backed dedup throws inside cascade-worker processes:

  ERROR Review-dispatch dedup Redis call failed — failing closed
    error: 'Error: REDIS_URL is required for review-dispatch dedup'

Confirmed live at 2026-05-01T17:25:44 in worker
`cascade-worker-coalesce_ucho_MNG-461_*`. Sentry tag: `review_dedup_redis_down`.

Root cause: `src/utils/envScrub.ts:13-18` lists REDIS_URL in
`SENSITIVE_ENV_KEYS`. Worker-entry calls `scrubSensitiveEnv()` early in
startup, which does `delete process.env.REDIS_URL` to keep infra secrets
out of agent-spawned subprocesses. The dedup module reads
`process.env.REDIS_URL` lazily at first dispatch — well after the scrub —
so it sees `undefined` and throws. The fail-closed handler returns false,
and the post-completion-hook silently logs `Skipping post-completion
review: already dispatched` (which looks identical to the working dedup's
"correctly rejected duplicate" log, hence the bug evaded the prod canary).

Why it didn't bite immediately: the post-completion-hook short-circuits
when CI isn't all passing at impl exit
(`agent-execution.ts:269-275`), so the bug only triggers on the narrow
timing where CI is fully green BEFORE impl finishes. Small fraction of
runs.

Why router is unaffected: cascade-router never calls scrubSensitiveEnv.
PR #196's review (verified 1-run earlier) was dispatched from the router
via check-suite-success — that path was never broken.

Fix: read the URL via `routerConfig.redisUrl` instead of
`process.env.REDIS_URL`. routerConfig is captured at module-load time in
`src/router/config.ts:137-138`, well before scrubSensitiveEnv runs. The
cached value survives the scrub the same way the DB pool's cached
connection string does — exactly the pattern envScrub.ts's docstring
describes.

Two-line source change + new regression-pin test that claims, deletes
process.env.REDIS_URL, and claims again — both must succeed (second
deduped) without throwing `REDIS_URL is required`.

Verification: vitest 2497/2497, typecheck clean, lint clean.

Out of scope (follow-up): `src/queue/cancel.ts` has the same
`process.env.REDIS_URL` lazy pattern but is only called from cascade-router
and cascade-dashboard (neither scrubs env), so it doesn't bite today.
Apply the same routerConfig swap for symmetry in a separate PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants