feat: self-healing monitor auto-retry after transient failures by bd73-com · Pull Request #302 · bd73-com/fetchthechange

bd73-com · 2026-04-01T08:13:49Z

Summary

When a scheduled monitor check fails due to a transient error (timeout, network blip), the monitor previously entered an Error health state and waited up to 24 hours for the next normally-scheduled check. This PR adds a single automatic re-check 35 minutes after a transient scrape failure so momentary site outages recover silently without user intervention.

Changes

Schema

Added nullable pendingRetryAt timestamp column to the monitors table (shared/schema.ts)
Omitted from insert schema (server-managed only, never exposed via /api/v1/)

Scraper (`server/services/scraper.ts`)

Added maybeScheduleAutoRetry() helper that sets pendingRetryAt 35 min in the future after transient failures
Guards: skips permanent errors (ENOTFOUND, certificate/SSL, SSRF), inactive monitors, already-pending retries, and monitors just auto-paused by handleMonitorFailure
Checks handleMonitorFailure's paused return value to avoid orphaning retries on paused monitors
Clears pendingRetryAt: null on successful check via existing storage.updateMonitor call
Added PERMANENT_ERROR_RE constant to centralize the permanent-error regex pattern

Scheduler (`server/services/scheduler.ts`)

Main cron loop picks up monitors with pendingRetryAt <= now as due for check
runCheckWithLimit captures pending-retry state before check and clears pendingRetryAt in finally block (awaited, not fire-and-forget)
Uses identical checkMonitor code path — no separate scraper logic

Routes (`server/routes.ts`)

Manual "Check Now" (POST /api/monitors/:id/check) clears pendingRetryAt before running the check to prevent race with scheduler cron

Key design constraints

Single retry only — if retry also fails, no further auto-retries until next scheduled check
consecutiveFailures always increments — retry is recovery, not suppression
No UI changes — invisible to users except errors self-resolve faster

How to test

Run npm run schema:push to add the pending_retry_at column
Create a monitor pointing to a URL that will timeout (e.g., a URL with a very slow response)
Wait for the scheduled check to fail — verify pending_retry_at is set ~35 minutes in the future in the DB
Wait for the retry to fire — verify pending_retry_at is cleared and [AutoRetry] log messages appear
Create a monitor with a broken selector — verify pending_retry_at remains NULL (no retry for non-transient errors)
Click "Check Now" on a monitor with a pending retry — verify pending_retry_at is cleared
Run npm run check && npm run test — all 1910 tests pass

https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6

Summary by CodeRabbit

New Features
- Transient monitor failures now schedule a one-time auto-retry ~35 minutes later; the system records a pending retry timestamp.
Behavior Changes / Bug Fixes
- Manual checks and scheduled retry attempts clear any pending retry marker after the attempt (cleared on both success and failure).
- Auto-retries are skipped for inactive monitors and permanent/non-retryable failures.
Tests
- Added extensive tests covering auto-retry scheduling, clearing, and edge cases.

Schedule a single automatic re-check 35 minutes after a transient scrape failure so momentary site outages recover silently. Adds pendingRetryAt column to monitors table, scheduler pickup logic, and clears the retry on success, manual check, or after execution. https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6

- Extract maybeScheduleAutoRetry helper to eliminate duplication - Check handleMonitorFailure paused return to skip retry for paused monitors - Fix stale lastChecked in imminence calculation (use current time) - Move manual Check Now pendingRetryAt clear before checkMonitor call - Await DB update in scheduler finally block instead of fire-and-forget - Add test for auto-paused monitor skipping retry https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6

coderabbitai · 2026-04-01T08:14:04Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a nullable pendingRetryAt column to monitors and implements an auto-retry flow: transient failures schedule a one-time retry (~35 minutes), the scheduler treats elapsed pendingRetryAt as eligible for an immediate check, and both scheduler and manual checks best-effort clear pendingRetryAt after attempts.

Changes

Cohort / File(s)	Summary
Schema `shared/schema.ts`	Added nullable `pendingRetryAt` timestamp (db column `pending_retry_at`) and omitted it from the insert schema.
Scraper (auto-retry scheduling & clearing) `server/services/scraper.ts`	Introduced `maybeScheduleAutoRetry(...)` to write `pendingRetryAt = now + ~35m` for transient errors; clears `pendingRetryAt` on successful check; writes use `db.update(monitors).set(...).where(eq(monitors.id, ...))` with logged errors.
Scheduler (detection & clearing) `server/services/scheduler.ts`	Per-minute loop now treats monitors with `pendingRetryAt <= now` as eligible for check; `runCheckWithLimit` attempts to clear `pendingRetryAt` in finally-block (errors logged, non-blocking).
Manual check endpoint `server/routes.ts`	Authenticated `api.monitors.check` route attempts to clear `pendingRetryAt` (best-effort) before performing a manual check; DB errors are caught and logged.
Tests — scraper & scheduler `server/services/scraper.test.ts`, `server/services/scheduler.test.ts`	Extended test helpers to include `pendingRetryAt`; added suites verifying scheduling on transient errors, non-scheduling for permanent/non-retryable cases, scheduler triggering on elapsed `pendingRetryAt`, and clearing of the flag on success/failure.
Mocks/support `server/services/scheduler.test.ts` (mocks), `server/services/scraper.test.ts` (mockDbUpdateChain)	Updated DB mocks to support `db.update().set(...).where(...)` chains and added helpers to capture `pendingRetryAt` writes.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant Scheduler as Scheduler
    participant Scraper as Scraper
    participant DB as Database
    participant Monitor as MonitorInstance

    rect rgba(255, 200, 100, 0.5)
    Note over Scraper,DB: Transient failure → schedule retry
    Scraper->>Monitor: perform check
    Monitor-->>Scraper: returns transient error
    Scraper->>Scraper: maybeScheduleAutoRetry()
    Scraper->>DB: update monitors set pendingRetryAt = now + 35m where id=...
    DB-->>Scraper: update result
    end

    rect rgba(150, 200, 255, 0.5)
    Note over Scheduler,DB: Scheduler triggers elapsed retry
    Scheduler->>DB: query monitors (includes pendingRetryAt)
    DB-->>Scheduler: returns monitor with pendingRetryAt <= now
    Scheduler->>Scraper: call checkMonitor(monitor)
    Scraper-->>Scheduler: check completes
    end

    rect rgba(200, 255, 150, 0.5)
    Note over Scheduler,DB: Cleanup (best-effort)
    alt any check completion
        Scheduler->>DB: update monitors set pendingRetryAt = null where id=...
        DB-->>Scheduler: update result (errors logged but non-blocking)
    end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: improve monitor error handling and classification #72: Touches manual check endpoint and error handling that overlap with pendingRetryAt clearing in server/routes.ts.
Fix code review findings and add comprehensive test coverage #38: Modifies scheduler dispatch logic (runCheckWithLimit / per-monitor loop) related to when checks are scheduled or skipped.
Add monitor auto-pause, metrics tracking, and security hardening #39: Alters monitors schema and DB-update patterns that intersect with adding/handling pendingRetryAt.

Security note: DB update operations that clear or set pendingRetryAt are best-effort and only log failures; consider ensuring monitoring/alerting for repeated update failures to avoid silent retry-state drift.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main feature: automatic retry of monitors after transient failures, which aligns perfectly with the changeset's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/monitor-auto-retry-mQxkV

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/services/scheduler.ts (1)
115-145: ⚠️ Potential issue | 🟠 Major

Re-read retry state before executing the jittered task.

hadPendingRetry is derived from the cron snapshot, not the current row. Because execution is delayed by up to 30 seconds, a user can hit Check Now after the scan, clear pendingRetryAt, and still get the stale auto-retry here. That reintroduces duplicate checks and duplicate notifications. Fetch the fresh monitor state before calling checkMonitor().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.ts` around lines 115 - 145, The snapshot-derived
hadPendingRetry can be stale; before calling checkMonitor(monitor) re-query the
current monitor row from the database (using the monitors table and the
monitor.id) and recompute hadPendingRetry from that fresh row's pendingRetryAt,
then proceed to call checkMonitor() with the up-to-date monitor object; also
ensure the finally block that clears pendingRetryAt operates against the same
fresh monitor state so you don't clear or miss clearing based on stale data (use
the fresh monitor's id/fields when updating).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/routes.ts`:
- Around line 589-597: The current route handler directly calls
db.update(...).set({ pendingRetryAt: null }) and catches errors, which fails
open; move this update behind the IStorage abstraction by adding a storage
method (e.g., IStorage.clearPendingRetry(monitorId) or
IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs
the db.update on the monitors table and returns/throws on error, then call that
method from the route instead of the direct db.update; ensure the route
aborts/returns an error (does not continue with the manual check) if the storage
call fails so the armed retry remains cleared only on success.

In `@server/services/scheduler.test.ts`:
- Around line 418-446: Add a new unit test alongside the existing cases that
creates a monitor via makeMonitor with pendingRetryAt set to new Date() (exactly
"now"), mockGetAllActiveMonitors to return that monitor, then call
startScheduler(), runCron("* * * * *") and advance timers
(vi.advanceTimersByTimeAsync(31000)), and finally assert mockCheckMonitor was
called with that monitor to verify the <= now boundary; place this test next to
the existing "pendingRetryAt <= now" and "in the future" tests so it uses the
same helpers (makeMonitor, mockGetAllActiveMonitors, startScheduler, runCron,
mockCheckMonitor).
- Around line 65-72: The update mock returns a set that creates a where function
but doesn't record its args; change the mock inside the update: replace the
inline creation of whereFn with a vi.fn(...) assigned to a shared test-level
recorder (e.g., mockDbUpdateWhere or reuse mockDbUpdateSet to store whereArgs)
so every call to where(...) is captured (returning the same whereResult
Promise). Then update the tests to assert that mockDbUpdateWhere was called with
the expected predicates (in addition to existing assertions on mockDbUpdateSet)
to prevent a silent removal of the where(...) clause in scheduler.update.
- Around line 456-463: The tests call fire-and-forget work via runCheckWithLimit
(triggered by runCron/startScheduler) which clears pendingRetryAt in a finally
block, but timers advancement doesn't drain microtasks; before the assertions
that check mockCheckMonitor and mockDbUpdateSet({ pendingRetryAt: null }) add an
awaited microtask drain (e.g., await flushPromises()) so the detached async
finally block completes and the assertions observe the cleared pendingRetryAt.

In `@server/services/scraper.test.ts`:
- Around line 6450-6464: Add a complementary test to assert the inverse
boundary: create a monitor (using makeMonitor) with frequency "hourly" whose
lastChecked yields a next regular run inside the 45-minute cutoff (e.g.,
lastChecked such that next check < 45 minutes away), mock fetch to fail as in
the existing test, call runWithTimers(monitor), and then assert that
dbUpdateSetCalls does NOT include an entry with pendingRetryAt (i.e.,
pendingRetryAt was not written). Keep the same spying/mocking pattern
(vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and reuse the existing
result.status assertion to confirm error before checking absence of
pendingRetryAt.
- Around line 6392-6448: Add a test case mirroring the existing error-path tests
that simulates an SSRF/blocked target by mocking global fetch to reject with an
Error whose message includes "URL is not allowed" (or the exact SSRF-block
message your classifier emits), create a monitor via makeMonitor with
frequency/lastChecked similar to the other tests, call runWithTimers(monitor),
assert result.status === "error", and then assert dbUpdateSetCalls does NOT
contain any call where pendingRetryAt is a Date (same pattern used in the other
tests); reference the existing test helpers makeMonitor, runWithTimers and the
dbUpdateSetCalls inspection to implement the new "does NOT set pendingRetryAt
when URL is not allowed / SSRF blocked" test.

In `@server/services/scraper.ts`:
- Around line 514-557: maybeScheduleAutoRetry currently inspects errorMessage
with PERMANENT_ERROR_RE and can schedule retries for failures that
classifyHttpStatus already marks permanent (e.g., 404/410); change the call
sites to pass an explicit boolean (e.g., isRetryable or retryable) into
maybeScheduleAutoRetry and update maybeScheduleAutoRetry signature to accept
that flag instead of regexing errorMessage; inside maybeScheduleAutoRetry use
monitor.active, wasPaused, monitor.pendingRetryAt and the new retryable flag to
decide early-return, and remove/ignore PERMANENT_ERROR_RE and the
errorMessage-based check; update any call sites that previously passed
errorMessage (where classifyHttpStatus is used) to compute and pass the
retryable boolean from classifyHttpStatus and continue logging errorMessage as
before.

In `@shared/schema.ts`:
- Line 30: The monitor response currently leaks the server-only pendingRetryAt
column because routes use z.custom<typeof monitors.$inferSelect>() (which does
no runtime filtering) and storage getters getMonitorsWithTags /
getMonitorWithTags select all columns; update the response schemas in
shared/routes.ts for all monitor endpoints (list, get, create, update, setTags)
to explicitly omit pendingRetryAt (e.g. use z.custom<Omit<typeof
monitors.$inferSelect, 'pendingRetryAt'>>()) or alternatively strip
pendingRetryAt in the storage getters (getMonitorsWithTags and
getMonitorWithTags) before returning results so the field is never serialized to
clients.

---

Outside diff comments:
In `@server/services/scheduler.ts`:
- Around line 115-145: The snapshot-derived hadPendingRetry can be stale; before
calling checkMonitor(monitor) re-query the current monitor row from the database
(using the monitors table and the monitor.id) and recompute hadPendingRetry from
that fresh row's pendingRetryAt, then proceed to call checkMonitor() with the
up-to-date monitor object; also ensure the finally block that clears
pendingRetryAt operates against the same fresh monitor state so you don't clear
or miss clearing based on stale data (use the fresh monitor's id/fields when
updating).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7bec5d18-d6cb-4682-a2f2-37d337f9efa9

📥 Commits

Reviewing files that changed from the base of the PR and between 5f2355f and 51feead.

📒 Files selected for processing (6)

server/routes.ts
server/services/scheduler.test.ts
server/services/scheduler.ts
server/services/scraper.test.ts
server/services/scraper.ts
shared/schema.ts

coderabbitai · 2026-04-01T08:25:51Z

server/routes.ts

+      // Clear any pending auto-retry before the manual check to prevent
+      // a narrow race where the scheduler cron fires a duplicate check.
+      await db.update(monitors)
+        .set({ pendingRetryAt: null })
+        .where(eq(monitors.id, id))
+        .catch((err: unknown) => {
+          console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
+            err instanceof Error ? err.message : err);
+        });


⚠️ Potential issue | 🟠 Major

Don't fail open when clearing pendingRetryAt.

If this write fails, the manual check still runs while the due retry remains armed, so the scheduler can execute the same monitor again and duplicate change rows or notifications. This guard should abort the request on failure, and the write should stay behind IStorage.

💡 Safer change

- await db.update(monitors) - .set({ pendingRetryAt: null }) - .where(eq(monitors.id, id)) - .catch((err: unknown) => { - console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`, - err instanceof Error ? err.message : err); - }); + try { + await storage.updateMonitor(id, { pendingRetryAt: null }); + } catch (err: unknown) { + console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`, + err instanceof Error ? err.message : err); + return res.status(503).json({ + message: "Unable to start the manual check right now.", + code: "RETRY_STATE_UPDATE_FAILED", + }); + }

As per coding guidelines, `Never put database queries or Drizzle ORM calls directly in route handlers — all database access must go through methods on the IStorage interface implemented in server/storage.ts`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/routes.ts` around lines 589 - 597, The current route handler directly calls db.update(...).set({ pendingRetryAt: null }) and catches errors, which fails open; move this update behind the IStorage abstraction by adding a storage method (e.g., IStorage.clearPendingRetry(monitorId) or IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs the db.update on the monitors table and returns/throws on error, then call that method from the route instead of the direct db.update; ensure the route aborts/returns an error (does not continue with the manual check) if the storage call fails so the armed retry remains cleared only on success.

coderabbitai · 2026-04-01T08:25:51Z

server/services/scheduler.test.ts

+    update: vi.fn().mockReturnValue({
+      set: (...args: any[]) => {
+        mockDbUpdateSet(...args);
+        const whereResult = Promise.resolve();
+        const whereFn = vi.fn().mockReturnValue(whereResult);
+        return { where: whereFn };
+      },
+    }),


⚠️ Potential issue | 🟠 Major

Capture and assert the where(...) clause for retry-clear updates.

Line 70 returns a where function, but its args are not recorded. With current assertions (Lines 462, 480), a regression that removes where(...) could still pass and hide a full-table update bug.

Suggested hardening for the DB update mock and assertions

const { mockCheckMonitor, mockGetAllActiveMonitors, mockCleanupPollutedValues, mockDbExecute, mockDbUpdateSet, + mockDbUpdateWhere, cronCallbacks, mockMonitorsNeedingRetry, mockDeliverWebhook, } = vi.hoisted(() => ({ @@ mockDbExecute: vi.fn().mockResolvedValue({ rowCount: 0 }), mockDbUpdateSet: vi.fn(), + mockDbUpdateWhere: vi.fn(), @@ update: vi.fn().mockReturnValue({ set: (...args: any[]) => { mockDbUpdateSet(...args); - const whereResult = Promise.resolve(); - const whereFn = vi.fn().mockReturnValue(whereResult); + const whereFn = vi.fn((...whereArgs: any[]) => { + mockDbUpdateWhere(...whereArgs); + return Promise.resolve(); + }); return { where: whereFn }; }, }),

expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null }); +expect(mockDbUpdateWhere).toHaveBeenCalled();

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

update: vi.fn().mockReturnValue({

set: (...args: any[]) => {

mockDbUpdateSet(...args);

const whereResult = Promise.resolve();

const whereFn = vi.fn().mockReturnValue(whereResult);

return { where: whereFn };

},

}),

update: vi.fn().mockReturnValue({

set: (...args: any[]) => {

mockDbUpdateSet(...args);

const whereFn = vi.fn((...whereArgs: any[]) => {

mockDbUpdateWhere(...whereArgs);

return Promise.resolve();

});

return { where: whereFn };

},

}),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/services/scheduler.test.ts` around lines 65 - 72, The update mock returns a set that creates a where function but doesn't record its args; change the mock inside the update: replace the inline creation of whereFn with a vi.fn(...) assigned to a shared test-level recorder (e.g., mockDbUpdateWhere or reuse mockDbUpdateSet to store whereArgs) so every call to where(...) is captured (returning the same whereResult Promise). Then update the tests to assert that mockDbUpdateWhere was called with the expected predicates (in addition to existing assertions on mockDbUpdateSet) to prevent a silent removal of the where(...) clause in scheduler.update.

coderabbitai · 2026-04-01T08:25:51Z

server/services/scheduler.test.ts

+  it("triggers check for monitor with pendingRetryAt <= now", async () => {
+    const monitor = makeMonitor({
+      frequency: "hourly",
+      lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due
+      pendingRetryAt: new Date(Date.now() - 1000), // 1 second in the past
+    });
+    mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);
+
+    await startScheduler();
+    await runCron("* * * * *");
+    await vi.advanceTimersByTimeAsync(31000);
+
+    expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
+  });
+
+  it("does NOT trigger check for monitor with pendingRetryAt in the future", async () => {
+    const monitor = makeMonitor({
+      frequency: "hourly",
+      lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due
+      pendingRetryAt: new Date(Date.now() + 30 * 60 * 1000), // 30 min in the future
+    });
+    mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);
+
+    await startScheduler();
+    await runCron("* * * * *");
+    await vi.advanceTimersByTimeAsync(31000);
+
+    expect(mockCheckMonitor).not.toHaveBeenCalled();
+  });


⚠️ Potential issue | 🟡 Minor

Add explicit boundary test for pendingRetryAt === now.

You cover past and future, but not the exact boundary for the <= now condition. Locking that case prevents subtle regressions.

Boundary test to add

+ it("triggers check for monitor with pendingRetryAt exactly now", async () => { + const now = new Date(); + vi.setSystemTime(now); + const monitor = makeMonitor({ + frequency: "hourly", + lastChecked: new Date(now.getTime() - 30 * 60 * 1000), + pendingRetryAt: now, + }); + mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]); + + await startScheduler(); + await runCron("* * * * *"); + await vi.advanceTimersByTimeAsync(31000); + + expect(mockCheckMonitor).toHaveBeenCalledWith(monitor); + });

As per coding guidelines, "Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/services/scheduler.test.ts` around lines 418 - 446, Add a new unit test alongside the existing cases that creates a monitor via makeMonitor with pendingRetryAt set to new Date() (exactly "now"), mockGetAllActiveMonitors to return that monitor, then call startScheduler(), runCron("* * * * *") and advance timers (vi.advanceTimersByTimeAsync(31000)), and finally assert mockCheckMonitor was called with that monitor to verify the <= now boundary; place this test next to the existing "pendingRetryAt <= now" and "in the future" tests so it uses the same helpers (makeMonitor, mockGetAllActiveMonitors, startScheduler, runCron, mockCheckMonitor).

coderabbitai · 2026-04-01T08:25:51Z

server/services/scheduler.test.ts

+    await startScheduler();
+    await runCron("* * * * *");
+    await vi.advanceTimersByTimeAsync(31000);
+
+    expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
+    // The finally block should clear pendingRetryAt
+    expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
+  });


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "scheduler.test.ts" -type f | head -5

Repository: bd73-com/fetchthechange

Length of output: 102

🏁 Script executed:

wc -l server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 105

🏁 Script executed:

sed -n '450,485p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1361

🏁 Script executed:

find . -name "scheduler.ts" -type f | grep -E "server/services" | head -5

Repository: bd73-com/fetchthechange

Length of output: 97

🏁 Script executed:

sed -n '1,100p' server/services/scheduler.ts | head -80

Repository: bd73-com/fetchthechange

Length of output: 3179

🏁 Script executed:

rg "finally" server/services/scheduler.ts -B3 -A3

Repository: bd73-com/fetchthechange

Length of output: 780

🏁 Script executed:

rg "pendingRetryAt" server/services/scheduler.ts -B3 -A3

Repository: bd73-com/fetchthechange

Length of output: 800

🏁 Script executed:

rg "runCheckWithLimit" server/services/scheduler.ts -B5 -A15

Repository: bd73-com/fetchthechange

Length of output: 2613

🏁 Script executed:

head -50 server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 2009

🏁 Script executed:

sed -n '1,200p' server/services/scheduler.test.ts | grep -E "import|flushPromises"

Repository: bd73-com/fetchthechange

Length of output: 602

🏁 Script executed:

rg "flushPromises" server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 177

🏁 Script executed:

rg "vi.runAllTimers|vi.advanceTimers" server/services/scheduler.test.ts | head -20

Repository: bd73-com/fetchthechange

Length of output: 1010

🏁 Script executed:

sed -n '440,485p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1640

🏁 Script executed:

rg "void runCheckWithLimit" server/services/scheduler.ts -B2 -A2

Repository: bd73-com/fetchthechange

Length of output: 624

🏁 Script executed:

sed -n '9,30p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 915

🏁 Script executed:

rg "async function flushPromises" server/services/scheduler.test.ts -A 10

Repository: bd73-com/fetchthechange

Length of output: 418

🏁 Script executed:

sed -n '456,495p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1404

🏁 Script executed:

sed -n '80,150p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 2414

🏁 Script executed:

rg "vi.mock.*db" server/services/scheduler.test.ts -A 30

Repository: bd73-com/fetchthechange

Length of output: 831

🏁 Script executed:

sed -n '448,490p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1595

🏁 Script executed:

rg "await flushPromises" server/services/scheduler.test.ts | head -20

Repository: bd73-com/fetchthechange

Length of output: 120

🏁 Script executed:

rg -n "await flushPromises" server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 130

🏁 Script executed:

sed -n '300,350p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1909

🏁 Script executed:

sed -n '1040,1075p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1547

Add await flushPromises() before assertions to prevent race conditions with detached async work.

Both tests at lines 456–463 (success path) and 474–480 (failure path) invoke void runCheckWithLimit(), which schedules a fire-and-forget async task that clears pendingRetryAt in its finally block. The vi.advanceTimersByTimeAsync() call advances timers but does not drain the microtask queue, so the clear may not execute before the assertions run, causing intermittent failures.

Add microtask flush before assertions

await startScheduler(); await runCron("* * * * *"); await vi.advanceTimersByTimeAsync(31000); + await flushPromises(); expect(mockCheckMonitor).toHaveBeenCalledWith(monitor); // The finally block should clear pendingRetryAt expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null }); @@ await startScheduler(); await runCron("* * * * *"); await vi.advanceTimersByTimeAsync(31000); + await flushPromises(); expect(mockCheckMonitor).toHaveBeenCalledWith(monitor); // Even on failure, pendingRetryAt should be cleared expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/services/scheduler.test.ts` around lines 456 - 463, The tests call fire-and-forget work via runCheckWithLimit (triggered by runCron/startScheduler) which clears pendingRetryAt in a finally block, but timers advancement doesn't drain microtasks; before the assertions that check mockCheckMonitor and mockDbUpdateSet({ pendingRetryAt: null }) add an awaited microtask drain (e.g., await flushPromises()) so the detached async finally block completes and the assertions observe the cleared pendingRetryAt.

server/services/scraper.test.ts

coderabbitai · 2026-04-01T08:25:51Z

server/services/scraper.test.ts

+  it("sets pendingRetryAt for hourly monitors (next check > 45 min after failure)", async () => {
+    vi.spyOn(globalThis, "fetch").mockRejectedValueOnce(new Error("Network failure"));
+
+    // hourly monitor — after failure, handleMonitorFailure updates lastChecked to now,
+    // so next normal check is 60 min away (> 45 min threshold)
+    const monitor = makeMonitor({
+      frequency: "hourly",
+      lastChecked: new Date(Date.now() - 30 * 60 * 1000),
+    });
+    const result = await runWithTimers(monitor);
+
+    expect(result.status).toBe("error");
+    const retrySetCall = dbUpdateSetCalls.find((c: any) => c.pendingRetryAt instanceof Date);
+    expect(retrySetCall).toBeDefined();
+  });


⚠️ Potential issue | 🟡 Minor

Lock down the cutoff with the inverse boundary case.

This only proves the > 45 min branch. The suite still misses the branch where the next regular run is already inside the cutoff, which is the timing boundary this PR just changed. Add a supported cadence that lands under that threshold and assert that pendingRetryAt is not written.

As per coding guidelines **/*.test.ts: Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/services/scraper.test.ts` around lines 6450 - 6464, Add a complementary test to assert the inverse boundary: create a monitor (using makeMonitor) with frequency "hourly" whose lastChecked yields a next regular run inside the 45-minute cutoff (e.g., lastChecked such that next check < 45 minutes away), mock fetch to fail as in the existing test, call runWithTimers(monitor), and then assert that dbUpdateSetCalls does NOT include an entry with pendingRetryAt (i.e., pendingRetryAt was not written). Keep the same spying/mocking pattern (vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and reuse the existing result.status assertion to confirm error before checking absence of pendingRetryAt.

coderabbitai · 2026-04-01T08:25:51Z

server/services/scraper.ts

+/** Permanent error patterns that should never trigger auto-retry. */
+const PERMANENT_ERROR_RE = /ENOTFOUND|certificate|ssl|tls|SSRF blocked|Could not resolve|SSL\/TLS error|URL is not allowed/i;
+
+/**
+ * Schedule a single auto-retry 35 minutes from now for transient scrape errors.
+ * Skips if the monitor was just paused, has a pending retry, the error is permanent,
+ * or the next normal check is imminent (within 45 minutes).
+ */
+async function maybeScheduleAutoRetry(
+  monitor: Monitor,
+  errorMessage: string,
+  wasPaused: boolean,
+): Promise<void> {
+  if (
+    !monitor.active ||
+    wasPaused ||
+    monitor.pendingRetryAt ||
+    PERMANENT_ERROR_RE.test(errorMessage)
+  ) {
+    return;
+  }
+
+  const frequencyMinutes = monitor.frequency === "hourly" ? 60 : 1440;
+  // Use Date.now() as the effective lastChecked since handleMonitorFailure
+  // already updated it in the DB — avoids using the stale in-memory value.
+  const minsUntilNormal = frequencyMinutes - 0; // just checked → full interval ahead
+  // More precisely: since we just failed, lastChecked was just set to now,
+  // so the next normal check is ~frequencyMinutes from now.
+
+  if (minsUntilNormal > 45) {
+    try {
+      const retryAt = new Date(Date.now() + 35 * 60 * 1000);
+      await db.update(monitors)
+        .set({ pendingRetryAt: retryAt })
+        .where(eq(monitors.id, monitor.id));
+      console.log(`[AutoRetry] Monitor ${monitor.id} — retry scheduled at ${retryAt.toISOString()}`);
+    } catch (err) {
+      console.error(`[AutoRetry] Failed to set pendingRetryAt for monitor ${monitor.id}:`,
+        err instanceof Error ? err.message : err);
+    }
+  } else {
+    console.log(`[AutoRetry] Monitor ${monitor.id} — skipped (next normal check in ${Math.round(minsUntilNormal)} min)`);
+  }
+}


⚠️ Potential issue | 🟠 Major

Permanent HTTP failures still schedule auto-retry.

maybeScheduleAutoRetry() reclassifies failures from errorMessage, but classifyHttpStatus() already has the structured transient/permanent decision. Messages like Page not found (HTTP 404) and Page no longer exists (HTTP 410) do not match PERMANENT_ERROR_RE, so those permanent failures still set pendingRetryAt. Pass an explicit retryable flag into this helper instead of regexing strings.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@server/services/scraper.ts` around lines 514 - 557, maybeScheduleAutoRetry currently inspects errorMessage with PERMANENT_ERROR_RE and can schedule retries for failures that classifyHttpStatus already marks permanent (e.g., 404/410); change the call sites to pass an explicit boolean (e.g., isRetryable or retryable) into maybeScheduleAutoRetry and update maybeScheduleAutoRetry signature to accept that flag instead of regexing errorMessage; inside maybeScheduleAutoRetry use monitor.active, wasPaused, monitor.pendingRetryAt and the new retryable flag to decide early-return, and remove/ignore PERMANENT_ERROR_RE and the errorMessage-based check; update any call sites that previously passed errorMessage (where classifyHttpStatus is used) to compute and pass the retryable boolean from classifyHttpStatus and continue logging errorMessage as before.

coderabbitai · 2026-04-01T08:25:51Z

shared/schema.ts

  pauseReason: text("pause_reason"),
  healthAlertSentAt: timestamp("health_alert_sent_at"),
  lastHealthyAt: timestamp("last_healthy_at"),
+  pendingRetryAt: timestamp("pending_retry_at"),


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== shared/routes.ts monitor schemas ==" rg -n -C4 'pendingRetryAt|insertMonitorSchema|api\.monitors\.(list|get)|monitor.*(schema|response)' shared/routes.ts || true echo echo "== storage monitor getters ==" rg -n -C12 'getMonitorsWithTags|getMonitorWithTags|getAllActiveMonitors' server/storage.ts || true echo echo "== routes returning monitor payloads ==" rg -n -C2 'res\.json$monitors$|res\.json$monitor$' server/routes.ts || true

Repository: bd73-com/fetchthechange

Length of output: 8031

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Verify z.custom behavior and type enforcement ==" rg -n 'z\.custom' shared/routes.ts -A2 -B2 echo echo "== Check for response middleware/interceptors ==" rg -n 'middleware|interceptor|filter.*response|pick.*monitor' server/routes.ts | head -20 || true echo echo "== Confirm pendingRetryAt is in monitors.$inferSelect by checking schema ==" rg -n 'pendingRetryAt' shared/schema.ts

Repository: bd73-com/fetchthechange

Length of output: 3944

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Confirm pendingRetryAt column in monitors table ==" rg -n 'pendingRetryAt' shared/schema.ts -B2 -A2 echo echo "== Check monitors table definition start ==" rg -n 'export const monitors' shared/schema.ts -A30

Repository: bd73-com/fetchthechange

Length of output: 3171

Critical: pendingRetryAt leaks in monitor GET/list responses due to missing response schema omit.

The z.custom<typeof monitors.$inferSelect>() response schemas at lines 131, 139, 149, 159, and 222 in shared/routes.ts provide TypeScript type hints only—they perform zero runtime validation or field filtering. Since pendingRetryAt is now in the monitors table schema, monitors.$inferSelect includes it. The storage getters (getMonitorsWithTags, getMonitorWithTags) execute bare db.select().from(monitors) queries that retrieve all columns, and the route handlers serialize these directly via res.json() without filtering. The field will be exposed to API clients despite being designated server-managed and omitted from insertMonitorSchema.

Update response schemas in shared/routes.ts to explicitly exclude pendingRetryAt:

200: z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()

Or filter in storage getters before returning. Verify all monitor response endpoints (list, get, create, update, setTags).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@shared/schema.ts` at line 30, The monitor response currently leaks the server-only pendingRetryAt column because routes use z.custom<typeof monitors.$inferSelect>() (which does no runtime filtering) and storage getters getMonitorsWithTags / getMonitorWithTags select all columns; update the response schemas in shared/routes.ts for all monitor endpoints (list, get, create, update, setTags) to explicitly omit pendingRetryAt (e.g. use z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()) or alternatively strip pendingRetryAt in the storage getters (getMonitorsWithTags and getMonitorWithTags) before returning results so the field is never serialized to clients.

- Add HTTP 4xx permanent error patterns (401, 403, 404, 410) to PERMANENT_ERROR_RE so monitors failing with permanent HTTP status codes don't schedule unnecessary auto-retries - Add test: SSRF-blocked URLs don't trigger auto-retry - Add test: HTTP 404 permanent errors don't trigger auto-retry https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6

claude added 2 commits April 1, 2026 07:38

github-actions bot added the feature New feature label Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

bd73-com merged commit dc258cc into main Apr 1, 2026
2 checks passed

bd73-com deleted the claude/monitor-auto-retry-mQxkV branch April 1, 2026 11:11

Conversation

bd73-com commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Schema

Scraper (server/services/scraper.ts)

Scheduler (server/services/scheduler.ts)

Routes (server/routes.ts)

Key design constraints

How to test

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bd73-com commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Scraper (`server/services/scraper.ts`)

Scheduler (`server/services/scheduler.ts`)

Routes (`server/routes.ts`)

coderabbitai bot commented Apr 1, 2026 •

edited

Loading