feat: self-healing monitor auto-retry after transient failures#302
feat: self-healing monitor auto-retry after transient failures#302
Conversation
Schedule a single automatic re-check 35 minutes after a transient scrape failure so momentary site outages recover silently. Adds pendingRetryAt column to monitors table, scheduler pickup logic, and clears the retry on success, manual check, or after execution. https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
- Extract maybeScheduleAutoRetry helper to eliminate duplication - Check handleMonitorFailure paused return to skip retry for paused monitors - Fix stale lastChecked in imminence calculation (use current time) - Move manual Check Now pendingRetryAt clear before checkMonitor call - Await DB update in scheduler finally block instead of fire-and-forget - Add test for auto-paused monitor skipping retry https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a nullable Changes
Sequence DiagramsequenceDiagram
participant Client as Client
participant Scheduler as Scheduler
participant Scraper as Scraper
participant DB as Database
participant Monitor as MonitorInstance
rect rgba(255, 200, 100, 0.5)
Note over Scraper,DB: Transient failure → schedule retry
Scraper->>Monitor: perform check
Monitor-->>Scraper: returns transient error
Scraper->>Scraper: maybeScheduleAutoRetry()
Scraper->>DB: update monitors set pendingRetryAt = now + 35m where id=...
DB-->>Scraper: update result
end
rect rgba(150, 200, 255, 0.5)
Note over Scheduler,DB: Scheduler triggers elapsed retry
Scheduler->>DB: query monitors (includes pendingRetryAt)
DB-->>Scheduler: returns monitor with pendingRetryAt <= now
Scheduler->>Scraper: call checkMonitor(monitor)
Scraper-->>Scheduler: check completes
end
rect rgba(200, 255, 150, 0.5)
Note over Scheduler,DB: Cleanup (best-effort)
alt any check completion
Scheduler->>DB: update monitors set pendingRetryAt = null where id=...
DB-->>Scheduler: update result (errors logged but non-blocking)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Security note: DB update operations that clear or set 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/services/scheduler.ts (1)
115-145:⚠️ Potential issue | 🟠 MajorRe-read retry state before executing the jittered task.
hadPendingRetryis derived from the cron snapshot, not the current row. Because execution is delayed by up to 30 seconds, a user can hitCheck Nowafter the scan, clearpendingRetryAt, and still get the stale auto-retry here. That reintroduces duplicate checks and duplicate notifications. Fetch the fresh monitor state before callingcheckMonitor().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/services/scheduler.ts` around lines 115 - 145, The snapshot-derived hadPendingRetry can be stale; before calling checkMonitor(monitor) re-query the current monitor row from the database (using the monitors table and the monitor.id) and recompute hadPendingRetry from that fresh row's pendingRetryAt, then proceed to call checkMonitor() with the up-to-date monitor object; also ensure the finally block that clears pendingRetryAt operates against the same fresh monitor state so you don't clear or miss clearing based on stale data (use the fresh monitor's id/fields when updating).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/routes.ts`:
- Around line 589-597: The current route handler directly calls
db.update(...).set({ pendingRetryAt: null }) and catches errors, which fails
open; move this update behind the IStorage abstraction by adding a storage
method (e.g., IStorage.clearPendingRetry(monitorId) or
IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs
the db.update on the monitors table and returns/throws on error, then call that
method from the route instead of the direct db.update; ensure the route
aborts/returns an error (does not continue with the manual check) if the storage
call fails so the armed retry remains cleared only on success.
In `@server/services/scheduler.test.ts`:
- Around line 418-446: Add a new unit test alongside the existing cases that
creates a monitor via makeMonitor with pendingRetryAt set to new Date() (exactly
"now"), mockGetAllActiveMonitors to return that monitor, then call
startScheduler(), runCron("* * * * *") and advance timers
(vi.advanceTimersByTimeAsync(31000)), and finally assert mockCheckMonitor was
called with that monitor to verify the <= now boundary; place this test next to
the existing "pendingRetryAt <= now" and "in the future" tests so it uses the
same helpers (makeMonitor, mockGetAllActiveMonitors, startScheduler, runCron,
mockCheckMonitor).
- Around line 65-72: The update mock returns a set that creates a where function
but doesn't record its args; change the mock inside the update: replace the
inline creation of whereFn with a vi.fn(...) assigned to a shared test-level
recorder (e.g., mockDbUpdateWhere or reuse mockDbUpdateSet to store whereArgs)
so every call to where(...) is captured (returning the same whereResult
Promise). Then update the tests to assert that mockDbUpdateWhere was called with
the expected predicates (in addition to existing assertions on mockDbUpdateSet)
to prevent a silent removal of the where(...) clause in scheduler.update.
- Around line 456-463: The tests call fire-and-forget work via runCheckWithLimit
(triggered by runCron/startScheduler) which clears pendingRetryAt in a finally
block, but timers advancement doesn't drain microtasks; before the assertions
that check mockCheckMonitor and mockDbUpdateSet({ pendingRetryAt: null }) add an
awaited microtask drain (e.g., await flushPromises()) so the detached async
finally block completes and the assertions observe the cleared pendingRetryAt.
In `@server/services/scraper.test.ts`:
- Around line 6450-6464: Add a complementary test to assert the inverse
boundary: create a monitor (using makeMonitor) with frequency "hourly" whose
lastChecked yields a next regular run inside the 45-minute cutoff (e.g.,
lastChecked such that next check < 45 minutes away), mock fetch to fail as in
the existing test, call runWithTimers(monitor), and then assert that
dbUpdateSetCalls does NOT include an entry with pendingRetryAt (i.e.,
pendingRetryAt was not written). Keep the same spying/mocking pattern
(vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and reuse the existing
result.status assertion to confirm error before checking absence of
pendingRetryAt.
- Around line 6392-6448: Add a test case mirroring the existing error-path tests
that simulates an SSRF/blocked target by mocking global fetch to reject with an
Error whose message includes "URL is not allowed" (or the exact SSRF-block
message your classifier emits), create a monitor via makeMonitor with
frequency/lastChecked similar to the other tests, call runWithTimers(monitor),
assert result.status === "error", and then assert dbUpdateSetCalls does NOT
contain any call where pendingRetryAt is a Date (same pattern used in the other
tests); reference the existing test helpers makeMonitor, runWithTimers and the
dbUpdateSetCalls inspection to implement the new "does NOT set pendingRetryAt
when URL is not allowed / SSRF blocked" test.
In `@server/services/scraper.ts`:
- Around line 514-557: maybeScheduleAutoRetry currently inspects errorMessage
with PERMANENT_ERROR_RE and can schedule retries for failures that
classifyHttpStatus already marks permanent (e.g., 404/410); change the call
sites to pass an explicit boolean (e.g., isRetryable or retryable) into
maybeScheduleAutoRetry and update maybeScheduleAutoRetry signature to accept
that flag instead of regexing errorMessage; inside maybeScheduleAutoRetry use
monitor.active, wasPaused, monitor.pendingRetryAt and the new retryable flag to
decide early-return, and remove/ignore PERMANENT_ERROR_RE and the
errorMessage-based check; update any call sites that previously passed
errorMessage (where classifyHttpStatus is used) to compute and pass the
retryable boolean from classifyHttpStatus and continue logging errorMessage as
before.
In `@shared/schema.ts`:
- Line 30: The monitor response currently leaks the server-only pendingRetryAt
column because routes use z.custom<typeof monitors.$inferSelect>() (which does
no runtime filtering) and storage getters getMonitorsWithTags /
getMonitorWithTags select all columns; update the response schemas in
shared/routes.ts for all monitor endpoints (list, get, create, update, setTags)
to explicitly omit pendingRetryAt (e.g. use z.custom<Omit<typeof
monitors.$inferSelect, 'pendingRetryAt'>>()) or alternatively strip
pendingRetryAt in the storage getters (getMonitorsWithTags and
getMonitorWithTags) before returning results so the field is never serialized to
clients.
---
Outside diff comments:
In `@server/services/scheduler.ts`:
- Around line 115-145: The snapshot-derived hadPendingRetry can be stale; before
calling checkMonitor(monitor) re-query the current monitor row from the database
(using the monitors table and the monitor.id) and recompute hadPendingRetry from
that fresh row's pendingRetryAt, then proceed to call checkMonitor() with the
up-to-date monitor object; also ensure the finally block that clears
pendingRetryAt operates against the same fresh monitor state so you don't clear
or miss clearing based on stale data (use the fresh monitor's id/fields when
updating).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7bec5d18-d6cb-4682-a2f2-37d337f9efa9
📒 Files selected for processing (6)
server/routes.tsserver/services/scheduler.test.tsserver/services/scheduler.tsserver/services/scraper.test.tsserver/services/scraper.tsshared/schema.ts
| // Clear any pending auto-retry before the manual check to prevent | ||
| // a narrow race where the scheduler cron fires a duplicate check. | ||
| await db.update(monitors) | ||
| .set({ pendingRetryAt: null }) | ||
| .where(eq(monitors.id, id)) | ||
| .catch((err: unknown) => { | ||
| console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`, | ||
| err instanceof Error ? err.message : err); | ||
| }); |
There was a problem hiding this comment.
Don't fail open when clearing pendingRetryAt.
If this write fails, the manual check still runs while the due retry remains armed, so the scheduler can execute the same monitor again and duplicate change rows or notifications. This guard should abort the request on failure, and the write should stay behind IStorage.
💡 Safer change
- await db.update(monitors)
- .set({ pendingRetryAt: null })
- .where(eq(monitors.id, id))
- .catch((err: unknown) => {
- console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
- err instanceof Error ? err.message : err);
- });
+ try {
+ await storage.updateMonitor(id, { pendingRetryAt: null });
+ } catch (err: unknown) {
+ console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
+ err instanceof Error ? err.message : err);
+ return res.status(503).json({
+ message: "Unable to start the manual check right now.",
+ code: "RETRY_STATE_UPDATE_FAILED",
+ });
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/routes.ts` around lines 589 - 597, The current route handler directly
calls db.update(...).set({ pendingRetryAt: null }) and catches errors, which
fails open; move this update behind the IStorage abstraction by adding a storage
method (e.g., IStorage.clearPendingRetry(monitorId) or
IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs
the db.update on the monitors table and returns/throws on error, then call that
method from the route instead of the direct db.update; ensure the route
aborts/returns an error (does not continue with the manual check) if the storage
call fails so the armed retry remains cleared only on success.
| update: vi.fn().mockReturnValue({ | ||
| set: (...args: any[]) => { | ||
| mockDbUpdateSet(...args); | ||
| const whereResult = Promise.resolve(); | ||
| const whereFn = vi.fn().mockReturnValue(whereResult); | ||
| return { where: whereFn }; | ||
| }, | ||
| }), |
There was a problem hiding this comment.
Capture and assert the where(...) clause for retry-clear updates.
Line 70 returns a where function, but its args are not recorded. With current assertions (Lines 462, 480), a regression that removes where(...) could still pass and hide a full-table update bug.
Suggested hardening for the DB update mock and assertions
const {
mockCheckMonitor,
mockGetAllActiveMonitors,
mockCleanupPollutedValues,
mockDbExecute,
mockDbUpdateSet,
+ mockDbUpdateWhere,
cronCallbacks,
mockMonitorsNeedingRetry,
mockDeliverWebhook,
} = vi.hoisted(() => ({
@@
mockDbExecute: vi.fn().mockResolvedValue({ rowCount: 0 }),
mockDbUpdateSet: vi.fn(),
+ mockDbUpdateWhere: vi.fn(),
@@
update: vi.fn().mockReturnValue({
set: (...args: any[]) => {
mockDbUpdateSet(...args);
- const whereResult = Promise.resolve();
- const whereFn = vi.fn().mockReturnValue(whereResult);
+ const whereFn = vi.fn((...whereArgs: any[]) => {
+ mockDbUpdateWhere(...whereArgs);
+ return Promise.resolve();
+ });
return { where: whereFn };
},
}), expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
+expect(mockDbUpdateWhere).toHaveBeenCalled();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| update: vi.fn().mockReturnValue({ | |
| set: (...args: any[]) => { | |
| mockDbUpdateSet(...args); | |
| const whereResult = Promise.resolve(); | |
| const whereFn = vi.fn().mockReturnValue(whereResult); | |
| return { where: whereFn }; | |
| }, | |
| }), | |
| update: vi.fn().mockReturnValue({ | |
| set: (...args: any[]) => { | |
| mockDbUpdateSet(...args); | |
| const whereFn = vi.fn((...whereArgs: any[]) => { | |
| mockDbUpdateWhere(...whereArgs); | |
| return Promise.resolve(); | |
| }); | |
| return { where: whereFn }; | |
| }, | |
| }), |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scheduler.test.ts` around lines 65 - 72, The update mock
returns a set that creates a where function but doesn't record its args; change
the mock inside the update: replace the inline creation of whereFn with a
vi.fn(...) assigned to a shared test-level recorder (e.g., mockDbUpdateWhere or
reuse mockDbUpdateSet to store whereArgs) so every call to where(...) is
captured (returning the same whereResult Promise). Then update the tests to
assert that mockDbUpdateWhere was called with the expected predicates (in
addition to existing assertions on mockDbUpdateSet) to prevent a silent removal
of the where(...) clause in scheduler.update.
| it("triggers check for monitor with pendingRetryAt <= now", async () => { | ||
| const monitor = makeMonitor({ | ||
| frequency: "hourly", | ||
| lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due | ||
| pendingRetryAt: new Date(Date.now() - 1000), // 1 second in the past | ||
| }); | ||
| mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]); | ||
|
|
||
| await startScheduler(); | ||
| await runCron("* * * * *"); | ||
| await vi.advanceTimersByTimeAsync(31000); | ||
|
|
||
| expect(mockCheckMonitor).toHaveBeenCalledWith(monitor); | ||
| }); | ||
|
|
||
| it("does NOT trigger check for monitor with pendingRetryAt in the future", async () => { | ||
| const monitor = makeMonitor({ | ||
| frequency: "hourly", | ||
| lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due | ||
| pendingRetryAt: new Date(Date.now() + 30 * 60 * 1000), // 30 min in the future | ||
| }); | ||
| mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]); | ||
|
|
||
| await startScheduler(); | ||
| await runCron("* * * * *"); | ||
| await vi.advanceTimersByTimeAsync(31000); | ||
|
|
||
| expect(mockCheckMonitor).not.toHaveBeenCalled(); | ||
| }); |
There was a problem hiding this comment.
Add explicit boundary test for pendingRetryAt === now.
You cover past and future, but not the exact boundary for the <= now condition. Locking that case prevents subtle regressions.
Boundary test to add
+ it("triggers check for monitor with pendingRetryAt exactly now", async () => {
+ const now = new Date();
+ vi.setSystemTime(now);
+ const monitor = makeMonitor({
+ frequency: "hourly",
+ lastChecked: new Date(now.getTime() - 30 * 60 * 1000),
+ pendingRetryAt: now,
+ });
+ mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);
+
+ await startScheduler();
+ await runCron("* * * * *");
+ await vi.advanceTimersByTimeAsync(31000);
+
+ expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
+ });As per coding guidelines, "Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scheduler.test.ts` around lines 418 - 446, Add a new unit
test alongside the existing cases that creates a monitor via makeMonitor with
pendingRetryAt set to new Date() (exactly "now"), mockGetAllActiveMonitors to
return that monitor, then call startScheduler(), runCron("* * * * *") and
advance timers (vi.advanceTimersByTimeAsync(31000)), and finally assert
mockCheckMonitor was called with that monitor to verify the <= now boundary;
place this test next to the existing "pendingRetryAt <= now" and "in the future"
tests so it uses the same helpers (makeMonitor, mockGetAllActiveMonitors,
startScheduler, runCron, mockCheckMonitor).
| await startScheduler(); | ||
| await runCron("* * * * *"); | ||
| await vi.advanceTimersByTimeAsync(31000); | ||
|
|
||
| expect(mockCheckMonitor).toHaveBeenCalledWith(monitor); | ||
| // The finally block should clear pendingRetryAt | ||
| expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null }); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "scheduler.test.ts" -type f | head -5Repository: bd73-com/fetchthechange
Length of output: 102
🏁 Script executed:
wc -l server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 105
🏁 Script executed:
sed -n '450,485p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1361
🏁 Script executed:
find . -name "scheduler.ts" -type f | grep -E "server/services" | head -5Repository: bd73-com/fetchthechange
Length of output: 97
🏁 Script executed:
sed -n '1,100p' server/services/scheduler.ts | head -80Repository: bd73-com/fetchthechange
Length of output: 3179
🏁 Script executed:
rg "finally" server/services/scheduler.ts -B3 -A3Repository: bd73-com/fetchthechange
Length of output: 780
🏁 Script executed:
rg "pendingRetryAt" server/services/scheduler.ts -B3 -A3Repository: bd73-com/fetchthechange
Length of output: 800
🏁 Script executed:
rg "runCheckWithLimit" server/services/scheduler.ts -B5 -A15Repository: bd73-com/fetchthechange
Length of output: 2613
🏁 Script executed:
head -50 server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 2009
🏁 Script executed:
sed -n '1,200p' server/services/scheduler.test.ts | grep -E "import|flushPromises"Repository: bd73-com/fetchthechange
Length of output: 602
🏁 Script executed:
rg "flushPromises" server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 177
🏁 Script executed:
rg "vi.runAllTimers|vi.advanceTimers" server/services/scheduler.test.ts | head -20Repository: bd73-com/fetchthechange
Length of output: 1010
🏁 Script executed:
sed -n '440,485p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1640
🏁 Script executed:
rg "void runCheckWithLimit" server/services/scheduler.ts -B2 -A2Repository: bd73-com/fetchthechange
Length of output: 624
🏁 Script executed:
sed -n '9,30p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 915
🏁 Script executed:
rg "async function flushPromises" server/services/scheduler.test.ts -A 10Repository: bd73-com/fetchthechange
Length of output: 418
🏁 Script executed:
sed -n '456,495p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1404
🏁 Script executed:
sed -n '80,150p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 2414
🏁 Script executed:
rg "vi.mock.*db" server/services/scheduler.test.ts -A 30Repository: bd73-com/fetchthechange
Length of output: 831
🏁 Script executed:
sed -n '448,490p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1595
🏁 Script executed:
rg "await flushPromises" server/services/scheduler.test.ts | head -20Repository: bd73-com/fetchthechange
Length of output: 120
🏁 Script executed:
rg -n "await flushPromises\(\)" server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 130
🏁 Script executed:
sed -n '300,350p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1909
🏁 Script executed:
sed -n '1040,1075p' server/services/scheduler.test.tsRepository: bd73-com/fetchthechange
Length of output: 1547
Add await flushPromises() before assertions to prevent race conditions with detached async work.
Both tests at lines 456–463 (success path) and 474–480 (failure path) invoke void runCheckWithLimit(), which schedules a fire-and-forget async task that clears pendingRetryAt in its finally block. The vi.advanceTimersByTimeAsync() call advances timers but does not drain the microtask queue, so the clear may not execute before the assertions run, causing intermittent failures.
Add microtask flush before assertions
await startScheduler();
await runCron("* * * * *");
await vi.advanceTimersByTimeAsync(31000);
+ await flushPromises();
expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
// The finally block should clear pendingRetryAt
expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
@@
await startScheduler();
await runCron("* * * * *");
await vi.advanceTimersByTimeAsync(31000);
+ await flushPromises();
expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
// Even on failure, pendingRetryAt should be cleared
expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scheduler.test.ts` around lines 456 - 463, The tests call
fire-and-forget work via runCheckWithLimit (triggered by runCron/startScheduler)
which clears pendingRetryAt in a finally block, but timers advancement doesn't
drain microtasks; before the assertions that check mockCheckMonitor and
mockDbUpdateSet({ pendingRetryAt: null }) add an awaited microtask drain (e.g.,
await flushPromises()) so the detached async finally block completes and the
assertions observe the cleared pendingRetryAt.
| it("sets pendingRetryAt for hourly monitors (next check > 45 min after failure)", async () => { | ||
| vi.spyOn(globalThis, "fetch").mockRejectedValueOnce(new Error("Network failure")); | ||
|
|
||
| // hourly monitor — after failure, handleMonitorFailure updates lastChecked to now, | ||
| // so next normal check is 60 min away (> 45 min threshold) | ||
| const monitor = makeMonitor({ | ||
| frequency: "hourly", | ||
| lastChecked: new Date(Date.now() - 30 * 60 * 1000), | ||
| }); | ||
| const result = await runWithTimers(monitor); | ||
|
|
||
| expect(result.status).toBe("error"); | ||
| const retrySetCall = dbUpdateSetCalls.find((c: any) => c.pendingRetryAt instanceof Date); | ||
| expect(retrySetCall).toBeDefined(); | ||
| }); |
There was a problem hiding this comment.
Lock down the cutoff with the inverse boundary case.
This only proves the > 45 min branch. The suite still misses the branch where the next regular run is already inside the cutoff, which is the timing boundary this PR just changed. Add a supported cadence that lands under that threshold and assert that pendingRetryAt is not written.
As per coding guidelines **/*.test.ts: Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scraper.test.ts` around lines 6450 - 6464, Add a
complementary test to assert the inverse boundary: create a monitor (using
makeMonitor) with frequency "hourly" whose lastChecked yields a next regular run
inside the 45-minute cutoff (e.g., lastChecked such that next check < 45 minutes
away), mock fetch to fail as in the existing test, call runWithTimers(monitor),
and then assert that dbUpdateSetCalls does NOT include an entry with
pendingRetryAt (i.e., pendingRetryAt was not written). Keep the same
spying/mocking pattern (vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and
reuse the existing result.status assertion to confirm error before checking
absence of pendingRetryAt.
server/services/scraper.ts
Outdated
| /** Permanent error patterns that should never trigger auto-retry. */ | ||
| const PERMANENT_ERROR_RE = /ENOTFOUND|certificate|ssl|tls|SSRF blocked|Could not resolve|SSL\/TLS error|URL is not allowed/i; | ||
|
|
||
| /** | ||
| * Schedule a single auto-retry 35 minutes from now for transient scrape errors. | ||
| * Skips if the monitor was just paused, has a pending retry, the error is permanent, | ||
| * or the next normal check is imminent (within 45 minutes). | ||
| */ | ||
| async function maybeScheduleAutoRetry( | ||
| monitor: Monitor, | ||
| errorMessage: string, | ||
| wasPaused: boolean, | ||
| ): Promise<void> { | ||
| if ( | ||
| !monitor.active || | ||
| wasPaused || | ||
| monitor.pendingRetryAt || | ||
| PERMANENT_ERROR_RE.test(errorMessage) | ||
| ) { | ||
| return; | ||
| } | ||
|
|
||
| const frequencyMinutes = monitor.frequency === "hourly" ? 60 : 1440; | ||
| // Use Date.now() as the effective lastChecked since handleMonitorFailure | ||
| // already updated it in the DB — avoids using the stale in-memory value. | ||
| const minsUntilNormal = frequencyMinutes - 0; // just checked → full interval ahead | ||
| // More precisely: since we just failed, lastChecked was just set to now, | ||
| // so the next normal check is ~frequencyMinutes from now. | ||
|
|
||
| if (minsUntilNormal > 45) { | ||
| try { | ||
| const retryAt = new Date(Date.now() + 35 * 60 * 1000); | ||
| await db.update(monitors) | ||
| .set({ pendingRetryAt: retryAt }) | ||
| .where(eq(monitors.id, monitor.id)); | ||
| console.log(`[AutoRetry] Monitor ${monitor.id} — retry scheduled at ${retryAt.toISOString()}`); | ||
| } catch (err) { | ||
| console.error(`[AutoRetry] Failed to set pendingRetryAt for monitor ${monitor.id}:`, | ||
| err instanceof Error ? err.message : err); | ||
| } | ||
| } else { | ||
| console.log(`[AutoRetry] Monitor ${monitor.id} — skipped (next normal check in ${Math.round(minsUntilNormal)} min)`); | ||
| } | ||
| } |
There was a problem hiding this comment.
Permanent HTTP failures still schedule auto-retry.
maybeScheduleAutoRetry() reclassifies failures from errorMessage, but classifyHttpStatus() already has the structured transient/permanent decision. Messages like Page not found (HTTP 404) and Page no longer exists (HTTP 410) do not match PERMANENT_ERROR_RE, so those permanent failures still set pendingRetryAt. Pass an explicit retryable flag into this helper instead of regexing strings.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/services/scraper.ts` around lines 514 - 557, maybeScheduleAutoRetry
currently inspects errorMessage with PERMANENT_ERROR_RE and can schedule retries
for failures that classifyHttpStatus already marks permanent (e.g., 404/410);
change the call sites to pass an explicit boolean (e.g., isRetryable or
retryable) into maybeScheduleAutoRetry and update maybeScheduleAutoRetry
signature to accept that flag instead of regexing errorMessage; inside
maybeScheduleAutoRetry use monitor.active, wasPaused, monitor.pendingRetryAt and
the new retryable flag to decide early-return, and remove/ignore
PERMANENT_ERROR_RE and the errorMessage-based check; update any call sites that
previously passed errorMessage (where classifyHttpStatus is used) to compute and
pass the retryable boolean from classifyHttpStatus and continue logging
errorMessage as before.
| pauseReason: text("pause_reason"), | ||
| healthAlertSentAt: timestamp("health_alert_sent_at"), | ||
| lastHealthyAt: timestamp("last_healthy_at"), | ||
| pendingRetryAt: timestamp("pending_retry_at"), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== shared/routes.ts monitor schemas =="
rg -n -C4 'pendingRetryAt|insertMonitorSchema|api\.monitors\.(list|get)|monitor.*(schema|response)' shared/routes.ts || true
echo
echo "== storage monitor getters =="
rg -n -C12 'getMonitorsWithTags|getMonitorWithTags|getAllActiveMonitors' server/storage.ts || true
echo
echo "== routes returning monitor payloads =="
rg -n -C2 'res\.json\(monitors\)|res\.json\(monitor\)' server/routes.ts || trueRepository: bd73-com/fetchthechange
Length of output: 8031
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Verify z.custom behavior and type enforcement =="
rg -n 'z\.custom' shared/routes.ts -A2 -B2
echo
echo "== Check for response middleware/interceptors =="
rg -n 'middleware|interceptor|filter.*response|pick.*monitor' server/routes.ts | head -20 || true
echo
echo "== Confirm pendingRetryAt is in monitors.$inferSelect by checking schema =="
rg -n 'pendingRetryAt' shared/schema.tsRepository: bd73-com/fetchthechange
Length of output: 3944
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Confirm pendingRetryAt column in monitors table =="
rg -n 'pendingRetryAt' shared/schema.ts -B2 -A2
echo
echo "== Check monitors table definition start =="
rg -n 'export const monitors' shared/schema.ts -A30Repository: bd73-com/fetchthechange
Length of output: 3171
Critical: pendingRetryAt leaks in monitor GET/list responses due to missing response schema omit.
The z.custom<typeof monitors.$inferSelect>() response schemas at lines 131, 139, 149, 159, and 222 in shared/routes.ts provide TypeScript type hints only—they perform zero runtime validation or field filtering. Since pendingRetryAt is now in the monitors table schema, monitors.$inferSelect includes it. The storage getters (getMonitorsWithTags, getMonitorWithTags) execute bare db.select().from(monitors) queries that retrieve all columns, and the route handlers serialize these directly via res.json() without filtering. The field will be exposed to API clients despite being designated server-managed and omitted from insertMonitorSchema.
Update response schemas in shared/routes.ts to explicitly exclude pendingRetryAt:
200: z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()
Or filter in storage getters before returning. Verify all monitor response endpoints (list, get, create, update, setTags).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@shared/schema.ts` at line 30, The monitor response currently leaks the
server-only pendingRetryAt column because routes use z.custom<typeof
monitors.$inferSelect>() (which does no runtime filtering) and storage getters
getMonitorsWithTags / getMonitorWithTags select all columns; update the response
schemas in shared/routes.ts for all monitor endpoints (list, get, create,
update, setTags) to explicitly omit pendingRetryAt (e.g. use
z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()) or
alternatively strip pendingRetryAt in the storage getters (getMonitorsWithTags
and getMonitorWithTags) before returning results so the field is never
serialized to clients.
- Add HTTP 4xx permanent error patterns (401, 403, 404, 410) to PERMANENT_ERROR_RE so monitors failing with permanent HTTP status codes don't schedule unnecessary auto-retries - Add test: SSRF-blocked URLs don't trigger auto-retry - Add test: HTTP 404 permanent errors don't trigger auto-retry https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
Summary
When a scheduled monitor check fails due to a transient error (timeout, network blip), the monitor previously entered an Error health state and waited up to 24 hours for the next normally-scheduled check. This PR adds a single automatic re-check 35 minutes after a transient scrape failure so momentary site outages recover silently without user intervention.
Changes
Schema
pendingRetryAttimestamp column to themonitorstable (shared/schema.ts)/api/v1/)Scraper (
server/services/scraper.ts)maybeScheduleAutoRetry()helper that setspendingRetryAt35 min in the future after transient failureshandleMonitorFailurehandleMonitorFailure'spausedreturn value to avoid orphaning retries on paused monitorspendingRetryAt: nullon successful check via existingstorage.updateMonitorcallPERMANENT_ERROR_REconstant to centralize the permanent-error regex patternScheduler (
server/services/scheduler.ts)pendingRetryAt <= nowas due for checkrunCheckWithLimitcaptures pending-retry state before check and clearspendingRetryAtinfinallyblock (awaited, not fire-and-forget)checkMonitorcode path — no separate scraper logicRoutes (
server/routes.ts)POST /api/monitors/:id/check) clearspendingRetryAtbefore running the check to prevent race with scheduler cronKey design constraints
consecutiveFailuresalways increments — retry is recovery, not suppressionHow to test
npm run schema:pushto add thepending_retry_atcolumnpending_retry_atis set ~35 minutes in the future in the DBpending_retry_atis cleared and[AutoRetry]log messages appearpending_retry_atremains NULL (no retry for non-transient errors)pending_retry_atis clearednpm run check && npm run test— all 1910 tests passhttps://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
Summary by CodeRabbit