Skip to content

feat: self-healing monitor auto-retry after transient failures#302

Merged
bd73-com merged 3 commits intomainfrom
claude/monitor-auto-retry-mQxkV
Apr 1, 2026
Merged

feat: self-healing monitor auto-retry after transient failures#302
bd73-com merged 3 commits intomainfrom
claude/monitor-auto-retry-mQxkV

Conversation

@bd73-com
Copy link
Copy Markdown
Owner

@bd73-com bd73-com commented Apr 1, 2026

Summary

When a scheduled monitor check fails due to a transient error (timeout, network blip), the monitor previously entered an Error health state and waited up to 24 hours for the next normally-scheduled check. This PR adds a single automatic re-check 35 minutes after a transient scrape failure so momentary site outages recover silently without user intervention.

Changes

Schema

  • Added nullable pendingRetryAt timestamp column to the monitors table (shared/schema.ts)
  • Omitted from insert schema (server-managed only, never exposed via /api/v1/)

Scraper (server/services/scraper.ts)

  • Added maybeScheduleAutoRetry() helper that sets pendingRetryAt 35 min in the future after transient failures
  • Guards: skips permanent errors (ENOTFOUND, certificate/SSL, SSRF), inactive monitors, already-pending retries, and monitors just auto-paused by handleMonitorFailure
  • Checks handleMonitorFailure's paused return value to avoid orphaning retries on paused monitors
  • Clears pendingRetryAt: null on successful check via existing storage.updateMonitor call
  • Added PERMANENT_ERROR_RE constant to centralize the permanent-error regex pattern

Scheduler (server/services/scheduler.ts)

  • Main cron loop picks up monitors with pendingRetryAt <= now as due for check
  • runCheckWithLimit captures pending-retry state before check and clears pendingRetryAt in finally block (awaited, not fire-and-forget)
  • Uses identical checkMonitor code path — no separate scraper logic

Routes (server/routes.ts)

  • Manual "Check Now" (POST /api/monitors/:id/check) clears pendingRetryAt before running the check to prevent race with scheduler cron

Key design constraints

  • Single retry only — if retry also fails, no further auto-retries until next scheduled check
  • consecutiveFailures always increments — retry is recovery, not suppression
  • No UI changes — invisible to users except errors self-resolve faster

How to test

  1. Run npm run schema:push to add the pending_retry_at column
  2. Create a monitor pointing to a URL that will timeout (e.g., a URL with a very slow response)
  3. Wait for the scheduled check to fail — verify pending_retry_at is set ~35 minutes in the future in the DB
  4. Wait for the retry to fire — verify pending_retry_at is cleared and [AutoRetry] log messages appear
  5. Create a monitor with a broken selector — verify pending_retry_at remains NULL (no retry for non-transient errors)
  6. Click "Check Now" on a monitor with a pending retry — verify pending_retry_at is cleared
  7. Run npm run check && npm run test — all 1910 tests pass

https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6

Summary by CodeRabbit

  • New Features
    • Transient monitor failures now schedule a one-time auto-retry ~35 minutes later; the system records a pending retry timestamp.
  • Behavior Changes / Bug Fixes
    • Manual checks and scheduled retry attempts clear any pending retry marker after the attempt (cleared on both success and failure).
    • Auto-retries are skipped for inactive monitors and permanent/non-retryable failures.
  • Tests
    • Added extensive tests covering auto-retry scheduling, clearing, and edge cases.

claude added 2 commits April 1, 2026 07:38
Schedule a single automatic re-check 35 minutes after a transient
scrape failure so momentary site outages recover silently. Adds
pendingRetryAt column to monitors table, scheduler pickup logic,
and clears the retry on success, manual check, or after execution.

https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
- Extract maybeScheduleAutoRetry helper to eliminate duplication
- Check handleMonitorFailure paused return to skip retry for paused monitors
- Fix stale lastChecked in imminence calculation (use current time)
- Move manual Check Now pendingRetryAt clear before checkMonitor call
- Await DB update in scheduler finally block instead of fire-and-forget
- Add test for auto-paused monitor skipping retry

https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
@github-actions github-actions bot added the feature New feature label Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Adds a nullable pendingRetryAt column to monitors and implements an auto-retry flow: transient failures schedule a one-time retry (~35 minutes), the scheduler treats elapsed pendingRetryAt as eligible for an immediate check, and both scheduler and manual checks best-effort clear pendingRetryAt after attempts.

Changes

Cohort / File(s) Summary
Schema
shared/schema.ts
Added nullable pendingRetryAt timestamp (db column pending_retry_at) and omitted it from the insert schema.
Scraper (auto-retry scheduling & clearing)
server/services/scraper.ts
Introduced maybeScheduleAutoRetry(...) to write pendingRetryAt = now + ~35m for transient errors; clears pendingRetryAt on successful check; writes use db.update(monitors).set(...).where(eq(monitors.id, ...)) with logged errors.
Scheduler (detection & clearing)
server/services/scheduler.ts
Per-minute loop now treats monitors with pendingRetryAt <= now as eligible for check; runCheckWithLimit attempts to clear pendingRetryAt in finally-block (errors logged, non-blocking).
Manual check endpoint
server/routes.ts
Authenticated api.monitors.check route attempts to clear pendingRetryAt (best-effort) before performing a manual check; DB errors are caught and logged.
Tests — scraper & scheduler
server/services/scraper.test.ts, server/services/scheduler.test.ts
Extended test helpers to include pendingRetryAt; added suites verifying scheduling on transient errors, non-scheduling for permanent/non-retryable cases, scheduler triggering on elapsed pendingRetryAt, and clearing of the flag on success/failure.
Mocks/support
server/services/scheduler.test.ts (mocks), server/services/scraper.test.ts (mockDbUpdateChain)
Updated DB mocks to support db.update().set(...).where(...) chains and added helpers to capture pendingRetryAt writes.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant Scheduler as Scheduler
    participant Scraper as Scraper
    participant DB as Database
    participant Monitor as MonitorInstance

    rect rgba(255, 200, 100, 0.5)
    Note over Scraper,DB: Transient failure → schedule retry
    Scraper->>Monitor: perform check
    Monitor-->>Scraper: returns transient error
    Scraper->>Scraper: maybeScheduleAutoRetry()
    Scraper->>DB: update monitors set pendingRetryAt = now + 35m where id=...
    DB-->>Scraper: update result
    end

    rect rgba(150, 200, 255, 0.5)
    Note over Scheduler,DB: Scheduler triggers elapsed retry
    Scheduler->>DB: query monitors (includes pendingRetryAt)
    DB-->>Scheduler: returns monitor with pendingRetryAt <= now
    Scheduler->>Scraper: call checkMonitor(monitor)
    Scraper-->>Scheduler: check completes
    end

    rect rgba(200, 255, 150, 0.5)
    Note over Scheduler,DB: Cleanup (best-effort)
    alt any check completion
        Scheduler->>DB: update monitors set pendingRetryAt = null where id=...
        DB-->>Scheduler: update result (errors logged but non-blocking)
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Security note: DB update operations that clear or set pendingRetryAt are best-effort and only log failures; consider ensuring monitoring/alerting for repeated update failures to avoid silent retry-state drift.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main feature: automatic retry of monitors after transient failures, which aligns perfectly with the changeset's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/monitor-auto-retry-mQxkV

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/services/scheduler.ts (1)

115-145: ⚠️ Potential issue | 🟠 Major

Re-read retry state before executing the jittered task.

hadPendingRetry is derived from the cron snapshot, not the current row. Because execution is delayed by up to 30 seconds, a user can hit Check Now after the scan, clear pendingRetryAt, and still get the stale auto-retry here. That reintroduces duplicate checks and duplicate notifications. Fetch the fresh monitor state before calling checkMonitor().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.ts` around lines 115 - 145, The snapshot-derived
hadPendingRetry can be stale; before calling checkMonitor(monitor) re-query the
current monitor row from the database (using the monitors table and the
monitor.id) and recompute hadPendingRetry from that fresh row's pendingRetryAt,
then proceed to call checkMonitor() with the up-to-date monitor object; also
ensure the finally block that clears pendingRetryAt operates against the same
fresh monitor state so you don't clear or miss clearing based on stale data (use
the fresh monitor's id/fields when updating).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/routes.ts`:
- Around line 589-597: The current route handler directly calls
db.update(...).set({ pendingRetryAt: null }) and catches errors, which fails
open; move this update behind the IStorage abstraction by adding a storage
method (e.g., IStorage.clearPendingRetry(monitorId) or
IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs
the db.update on the monitors table and returns/throws on error, then call that
method from the route instead of the direct db.update; ensure the route
aborts/returns an error (does not continue with the manual check) if the storage
call fails so the armed retry remains cleared only on success.

In `@server/services/scheduler.test.ts`:
- Around line 418-446: Add a new unit test alongside the existing cases that
creates a monitor via makeMonitor with pendingRetryAt set to new Date() (exactly
"now"), mockGetAllActiveMonitors to return that monitor, then call
startScheduler(), runCron("* * * * *") and advance timers
(vi.advanceTimersByTimeAsync(31000)), and finally assert mockCheckMonitor was
called with that monitor to verify the <= now boundary; place this test next to
the existing "pendingRetryAt <= now" and "in the future" tests so it uses the
same helpers (makeMonitor, mockGetAllActiveMonitors, startScheduler, runCron,
mockCheckMonitor).
- Around line 65-72: The update mock returns a set that creates a where function
but doesn't record its args; change the mock inside the update: replace the
inline creation of whereFn with a vi.fn(...) assigned to a shared test-level
recorder (e.g., mockDbUpdateWhere or reuse mockDbUpdateSet to store whereArgs)
so every call to where(...) is captured (returning the same whereResult
Promise). Then update the tests to assert that mockDbUpdateWhere was called with
the expected predicates (in addition to existing assertions on mockDbUpdateSet)
to prevent a silent removal of the where(...) clause in scheduler.update.
- Around line 456-463: The tests call fire-and-forget work via runCheckWithLimit
(triggered by runCron/startScheduler) which clears pendingRetryAt in a finally
block, but timers advancement doesn't drain microtasks; before the assertions
that check mockCheckMonitor and mockDbUpdateSet({ pendingRetryAt: null }) add an
awaited microtask drain (e.g., await flushPromises()) so the detached async
finally block completes and the assertions observe the cleared pendingRetryAt.

In `@server/services/scraper.test.ts`:
- Around line 6450-6464: Add a complementary test to assert the inverse
boundary: create a monitor (using makeMonitor) with frequency "hourly" whose
lastChecked yields a next regular run inside the 45-minute cutoff (e.g.,
lastChecked such that next check < 45 minutes away), mock fetch to fail as in
the existing test, call runWithTimers(monitor), and then assert that
dbUpdateSetCalls does NOT include an entry with pendingRetryAt (i.e.,
pendingRetryAt was not written). Keep the same spying/mocking pattern
(vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and reuse the existing
result.status assertion to confirm error before checking absence of
pendingRetryAt.
- Around line 6392-6448: Add a test case mirroring the existing error-path tests
that simulates an SSRF/blocked target by mocking global fetch to reject with an
Error whose message includes "URL is not allowed" (or the exact SSRF-block
message your classifier emits), create a monitor via makeMonitor with
frequency/lastChecked similar to the other tests, call runWithTimers(monitor),
assert result.status === "error", and then assert dbUpdateSetCalls does NOT
contain any call where pendingRetryAt is a Date (same pattern used in the other
tests); reference the existing test helpers makeMonitor, runWithTimers and the
dbUpdateSetCalls inspection to implement the new "does NOT set pendingRetryAt
when URL is not allowed / SSRF blocked" test.

In `@server/services/scraper.ts`:
- Around line 514-557: maybeScheduleAutoRetry currently inspects errorMessage
with PERMANENT_ERROR_RE and can schedule retries for failures that
classifyHttpStatus already marks permanent (e.g., 404/410); change the call
sites to pass an explicit boolean (e.g., isRetryable or retryable) into
maybeScheduleAutoRetry and update maybeScheduleAutoRetry signature to accept
that flag instead of regexing errorMessage; inside maybeScheduleAutoRetry use
monitor.active, wasPaused, monitor.pendingRetryAt and the new retryable flag to
decide early-return, and remove/ignore PERMANENT_ERROR_RE and the
errorMessage-based check; update any call sites that previously passed
errorMessage (where classifyHttpStatus is used) to compute and pass the
retryable boolean from classifyHttpStatus and continue logging errorMessage as
before.

In `@shared/schema.ts`:
- Line 30: The monitor response currently leaks the server-only pendingRetryAt
column because routes use z.custom<typeof monitors.$inferSelect>() (which does
no runtime filtering) and storage getters getMonitorsWithTags /
getMonitorWithTags select all columns; update the response schemas in
shared/routes.ts for all monitor endpoints (list, get, create, update, setTags)
to explicitly omit pendingRetryAt (e.g. use z.custom<Omit<typeof
monitors.$inferSelect, 'pendingRetryAt'>>()) or alternatively strip
pendingRetryAt in the storage getters (getMonitorsWithTags and
getMonitorWithTags) before returning results so the field is never serialized to
clients.

---

Outside diff comments:
In `@server/services/scheduler.ts`:
- Around line 115-145: The snapshot-derived hadPendingRetry can be stale; before
calling checkMonitor(monitor) re-query the current monitor row from the database
(using the monitors table and the monitor.id) and recompute hadPendingRetry from
that fresh row's pendingRetryAt, then proceed to call checkMonitor() with the
up-to-date monitor object; also ensure the finally block that clears
pendingRetryAt operates against the same fresh monitor state so you don't clear
or miss clearing based on stale data (use the fresh monitor's id/fields when
updating).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7bec5d18-d6cb-4682-a2f2-37d337f9efa9

📥 Commits

Reviewing files that changed from the base of the PR and between 5f2355f and 51feead.

📒 Files selected for processing (6)
  • server/routes.ts
  • server/services/scheduler.test.ts
  • server/services/scheduler.ts
  • server/services/scraper.test.ts
  • server/services/scraper.ts
  • shared/schema.ts

Comment on lines +589 to +597
// Clear any pending auto-retry before the manual check to prevent
// a narrow race where the scheduler cron fires a duplicate check.
await db.update(monitors)
.set({ pendingRetryAt: null })
.where(eq(monitors.id, id))
.catch((err: unknown) => {
console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
err instanceof Error ? err.message : err);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't fail open when clearing pendingRetryAt.

If this write fails, the manual check still runs while the due retry remains armed, so the scheduler can execute the same monitor again and duplicate change rows or notifications. This guard should abort the request on failure, and the write should stay behind IStorage.

💡 Safer change
-      await db.update(monitors)
-        .set({ pendingRetryAt: null })
-        .where(eq(monitors.id, id))
-        .catch((err: unknown) => {
-          console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
-            err instanceof Error ? err.message : err);
-        });
+      try {
+        await storage.updateMonitor(id, { pendingRetryAt: null });
+      } catch (err: unknown) {
+        console.error(`[AutoRetry] Failed to clear pendingRetryAt for monitor ${id}:`,
+          err instanceof Error ? err.message : err);
+        return res.status(503).json({
+          message: "Unable to start the manual check right now.",
+          code: "RETRY_STATE_UPDATE_FAILED",
+        });
+      }
As per coding guidelines, `Never put database queries or Drizzle ORM calls directly in route handlers — all database access must go through methods on the IStorage interface implemented in server/storage.ts`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/routes.ts` around lines 589 - 597, The current route handler directly
calls db.update(...).set({ pendingRetryAt: null }) and catches errors, which
fails open; move this update behind the IStorage abstraction by adding a storage
method (e.g., IStorage.clearPendingRetry(monitorId) or
IStorage.setPendingRetryNull(id)) implemented in server/storage.ts that performs
the db.update on the monitors table and returns/throws on error, then call that
method from the route instead of the direct db.update; ensure the route
aborts/returns an error (does not continue with the manual check) if the storage
call fails so the armed retry remains cleared only on success.

Comment on lines +65 to +72
update: vi.fn().mockReturnValue({
set: (...args: any[]) => {
mockDbUpdateSet(...args);
const whereResult = Promise.resolve();
const whereFn = vi.fn().mockReturnValue(whereResult);
return { where: whereFn };
},
}),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Capture and assert the where(...) clause for retry-clear updates.

Line 70 returns a where function, but its args are not recorded. With current assertions (Lines 462, 480), a regression that removes where(...) could still pass and hide a full-table update bug.

Suggested hardening for the DB update mock and assertions
 const {
   mockCheckMonitor,
   mockGetAllActiveMonitors,
   mockCleanupPollutedValues,
   mockDbExecute,
   mockDbUpdateSet,
+  mockDbUpdateWhere,
   cronCallbacks,
   mockMonitorsNeedingRetry,
   mockDeliverWebhook,
 } = vi.hoisted(() => ({
@@
   mockDbExecute: vi.fn().mockResolvedValue({ rowCount: 0 }),
   mockDbUpdateSet: vi.fn(),
+  mockDbUpdateWhere: vi.fn(),
@@
     update: vi.fn().mockReturnValue({
       set: (...args: any[]) => {
         mockDbUpdateSet(...args);
-        const whereResult = Promise.resolve();
-        const whereFn = vi.fn().mockReturnValue(whereResult);
+        const whereFn = vi.fn((...whereArgs: any[]) => {
+          mockDbUpdateWhere(...whereArgs);
+          return Promise.resolve();
+        });
         return { where: whereFn };
       },
     }),
 expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
+expect(mockDbUpdateWhere).toHaveBeenCalled();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
update: vi.fn().mockReturnValue({
set: (...args: any[]) => {
mockDbUpdateSet(...args);
const whereResult = Promise.resolve();
const whereFn = vi.fn().mockReturnValue(whereResult);
return { where: whereFn };
},
}),
update: vi.fn().mockReturnValue({
set: (...args: any[]) => {
mockDbUpdateSet(...args);
const whereFn = vi.fn((...whereArgs: any[]) => {
mockDbUpdateWhere(...whereArgs);
return Promise.resolve();
});
return { where: whereFn };
},
}),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.test.ts` around lines 65 - 72, The update mock
returns a set that creates a where function but doesn't record its args; change
the mock inside the update: replace the inline creation of whereFn with a
vi.fn(...) assigned to a shared test-level recorder (e.g., mockDbUpdateWhere or
reuse mockDbUpdateSet to store whereArgs) so every call to where(...) is
captured (returning the same whereResult Promise). Then update the tests to
assert that mockDbUpdateWhere was called with the expected predicates (in
addition to existing assertions on mockDbUpdateSet) to prevent a silent removal
of the where(...) clause in scheduler.update.

Comment on lines +418 to +446
it("triggers check for monitor with pendingRetryAt <= now", async () => {
const monitor = makeMonitor({
frequency: "hourly",
lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due
pendingRetryAt: new Date(Date.now() - 1000), // 1 second in the past
});
mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);

await startScheduler();
await runCron("* * * * *");
await vi.advanceTimersByTimeAsync(31000);

expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
});

it("does NOT trigger check for monitor with pendingRetryAt in the future", async () => {
const monitor = makeMonitor({
frequency: "hourly",
lastChecked: new Date(Date.now() - 30 * 60 * 1000), // 30 min ago — not normally due
pendingRetryAt: new Date(Date.now() + 30 * 60 * 1000), // 30 min in the future
});
mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);

await startScheduler();
await runCron("* * * * *");
await vi.advanceTimersByTimeAsync(31000);

expect(mockCheckMonitor).not.toHaveBeenCalled();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit boundary test for pendingRetryAt === now.

You cover past and future, but not the exact boundary for the <= now condition. Locking that case prevents subtle regressions.

Boundary test to add
+  it("triggers check for monitor with pendingRetryAt exactly now", async () => {
+    const now = new Date();
+    vi.setSystemTime(now);
+    const monitor = makeMonitor({
+      frequency: "hourly",
+      lastChecked: new Date(now.getTime() - 30 * 60 * 1000),
+      pendingRetryAt: now,
+    });
+    mockGetAllActiveMonitors.mockResolvedValueOnce([monitor]);
+
+    await startScheduler();
+    await runCron("* * * * *");
+    await vi.advanceTimersByTimeAsync(31000);
+
+    expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
+  });

As per coding guidelines, "Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.test.ts` around lines 418 - 446, Add a new unit
test alongside the existing cases that creates a monitor via makeMonitor with
pendingRetryAt set to new Date() (exactly "now"), mockGetAllActiveMonitors to
return that monitor, then call startScheduler(), runCron("* * * * *") and
advance timers (vi.advanceTimersByTimeAsync(31000)), and finally assert
mockCheckMonitor was called with that monitor to verify the <= now boundary;
place this test next to the existing "pendingRetryAt <= now" and "in the future"
tests so it uses the same helpers (makeMonitor, mockGetAllActiveMonitors,
startScheduler, runCron, mockCheckMonitor).

Comment on lines +456 to +463
await startScheduler();
await runCron("* * * * *");
await vi.advanceTimersByTimeAsync(31000);

expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
// The finally block should clear pendingRetryAt
expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "scheduler.test.ts" -type f | head -5

Repository: bd73-com/fetchthechange

Length of output: 102


🏁 Script executed:

wc -l server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 105


🏁 Script executed:

sed -n '450,485p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1361


🏁 Script executed:

find . -name "scheduler.ts" -type f | grep -E "server/services" | head -5

Repository: bd73-com/fetchthechange

Length of output: 97


🏁 Script executed:

sed -n '1,100p' server/services/scheduler.ts | head -80

Repository: bd73-com/fetchthechange

Length of output: 3179


🏁 Script executed:

rg "finally" server/services/scheduler.ts -B3 -A3

Repository: bd73-com/fetchthechange

Length of output: 780


🏁 Script executed:

rg "pendingRetryAt" server/services/scheduler.ts -B3 -A3

Repository: bd73-com/fetchthechange

Length of output: 800


🏁 Script executed:

rg "runCheckWithLimit" server/services/scheduler.ts -B5 -A15

Repository: bd73-com/fetchthechange

Length of output: 2613


🏁 Script executed:

head -50 server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 2009


🏁 Script executed:

sed -n '1,200p' server/services/scheduler.test.ts | grep -E "import|flushPromises"

Repository: bd73-com/fetchthechange

Length of output: 602


🏁 Script executed:

rg "flushPromises" server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 177


🏁 Script executed:

rg "vi.runAllTimers|vi.advanceTimers" server/services/scheduler.test.ts | head -20

Repository: bd73-com/fetchthechange

Length of output: 1010


🏁 Script executed:

sed -n '440,485p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1640


🏁 Script executed:

rg "void runCheckWithLimit" server/services/scheduler.ts -B2 -A2

Repository: bd73-com/fetchthechange

Length of output: 624


🏁 Script executed:

sed -n '9,30p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 915


🏁 Script executed:

rg "async function flushPromises" server/services/scheduler.test.ts -A 10

Repository: bd73-com/fetchthechange

Length of output: 418


🏁 Script executed:

sed -n '456,495p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1404


🏁 Script executed:

sed -n '80,150p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 2414


🏁 Script executed:

rg "vi.mock.*db" server/services/scheduler.test.ts -A 30

Repository: bd73-com/fetchthechange

Length of output: 831


🏁 Script executed:

sed -n '448,490p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1595


🏁 Script executed:

rg "await flushPromises" server/services/scheduler.test.ts | head -20

Repository: bd73-com/fetchthechange

Length of output: 120


🏁 Script executed:

rg -n "await flushPromises\(\)" server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 130


🏁 Script executed:

sed -n '300,350p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1909


🏁 Script executed:

sed -n '1040,1075p' server/services/scheduler.test.ts

Repository: bd73-com/fetchthechange

Length of output: 1547


Add await flushPromises() before assertions to prevent race conditions with detached async work.

Both tests at lines 456–463 (success path) and 474–480 (failure path) invoke void runCheckWithLimit(), which schedules a fire-and-forget async task that clears pendingRetryAt in its finally block. The vi.advanceTimersByTimeAsync() call advances timers but does not drain the microtask queue, so the clear may not execute before the assertions run, causing intermittent failures.

Add microtask flush before assertions
   await startScheduler();
   await runCron("* * * * *");
   await vi.advanceTimersByTimeAsync(31000);
+  await flushPromises();

   expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
   // The finally block should clear pendingRetryAt
   expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
@@
   await startScheduler();
   await runCron("* * * * *");
   await vi.advanceTimersByTimeAsync(31000);
+  await flushPromises();

   expect(mockCheckMonitor).toHaveBeenCalledWith(monitor);
   // Even on failure, pendingRetryAt should be cleared
   expect(mockDbUpdateSet).toHaveBeenCalledWith({ pendingRetryAt: null });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scheduler.test.ts` around lines 456 - 463, The tests call
fire-and-forget work via runCheckWithLimit (triggered by runCron/startScheduler)
which clears pendingRetryAt in a finally block, but timers advancement doesn't
drain microtasks; before the assertions that check mockCheckMonitor and
mockDbUpdateSet({ pendingRetryAt: null }) add an awaited microtask drain (e.g.,
await flushPromises()) so the detached async finally block completes and the
assertions observe the cleared pendingRetryAt.

Comment on lines +6450 to +6464
it("sets pendingRetryAt for hourly monitors (next check > 45 min after failure)", async () => {
vi.spyOn(globalThis, "fetch").mockRejectedValueOnce(new Error("Network failure"));

// hourly monitor — after failure, handleMonitorFailure updates lastChecked to now,
// so next normal check is 60 min away (> 45 min threshold)
const monitor = makeMonitor({
frequency: "hourly",
lastChecked: new Date(Date.now() - 30 * 60 * 1000),
});
const result = await runWithTimers(monitor);

expect(result.status).toBe("error");
const retrySetCall = dbUpdateSetCalls.find((c: any) => c.pendingRetryAt instanceof Date);
expect(retrySetCall).toBeDefined();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Lock down the cutoff with the inverse boundary case.

This only proves the > 45 min branch. The suite still misses the branch where the next regular run is already inside the cutoff, which is the timing boundary this PR just changed. Add a supported cadence that lands under that threshold and assert that pendingRetryAt is not written.

As per coding guidelines **/*.test.ts: Tests must cover edge cases and error paths - include assertions for edge cases (empty inputs, boundary values, null/undefined).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scraper.test.ts` around lines 6450 - 6464, Add a
complementary test to assert the inverse boundary: create a monitor (using
makeMonitor) with frequency "hourly" whose lastChecked yields a next regular run
inside the 45-minute cutoff (e.g., lastChecked such that next check < 45 minutes
away), mock fetch to fail as in the existing test, call runWithTimers(monitor),
and then assert that dbUpdateSetCalls does NOT include an entry with
pendingRetryAt (i.e., pendingRetryAt was not written). Keep the same
spying/mocking pattern (vi.spyOn(globalThis, "fetch").mockRejectedValueOnce) and
reuse the existing result.status assertion to confirm error before checking
absence of pendingRetryAt.

Comment on lines +514 to +557
/** Permanent error patterns that should never trigger auto-retry. */
const PERMANENT_ERROR_RE = /ENOTFOUND|certificate|ssl|tls|SSRF blocked|Could not resolve|SSL\/TLS error|URL is not allowed/i;

/**
* Schedule a single auto-retry 35 minutes from now for transient scrape errors.
* Skips if the monitor was just paused, has a pending retry, the error is permanent,
* or the next normal check is imminent (within 45 minutes).
*/
async function maybeScheduleAutoRetry(
monitor: Monitor,
errorMessage: string,
wasPaused: boolean,
): Promise<void> {
if (
!monitor.active ||
wasPaused ||
monitor.pendingRetryAt ||
PERMANENT_ERROR_RE.test(errorMessage)
) {
return;
}

const frequencyMinutes = monitor.frequency === "hourly" ? 60 : 1440;
// Use Date.now() as the effective lastChecked since handleMonitorFailure
// already updated it in the DB — avoids using the stale in-memory value.
const minsUntilNormal = frequencyMinutes - 0; // just checked → full interval ahead
// More precisely: since we just failed, lastChecked was just set to now,
// so the next normal check is ~frequencyMinutes from now.

if (minsUntilNormal > 45) {
try {
const retryAt = new Date(Date.now() + 35 * 60 * 1000);
await db.update(monitors)
.set({ pendingRetryAt: retryAt })
.where(eq(monitors.id, monitor.id));
console.log(`[AutoRetry] Monitor ${monitor.id} — retry scheduled at ${retryAt.toISOString()}`);
} catch (err) {
console.error(`[AutoRetry] Failed to set pendingRetryAt for monitor ${monitor.id}:`,
err instanceof Error ? err.message : err);
}
} else {
console.log(`[AutoRetry] Monitor ${monitor.id} — skipped (next normal check in ${Math.round(minsUntilNormal)} min)`);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Permanent HTTP failures still schedule auto-retry.

maybeScheduleAutoRetry() reclassifies failures from errorMessage, but classifyHttpStatus() already has the structured transient/permanent decision. Messages like Page not found (HTTP 404) and Page no longer exists (HTTP 410) do not match PERMANENT_ERROR_RE, so those permanent failures still set pendingRetryAt. Pass an explicit retryable flag into this helper instead of regexing strings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/services/scraper.ts` around lines 514 - 557, maybeScheduleAutoRetry
currently inspects errorMessage with PERMANENT_ERROR_RE and can schedule retries
for failures that classifyHttpStatus already marks permanent (e.g., 404/410);
change the call sites to pass an explicit boolean (e.g., isRetryable or
retryable) into maybeScheduleAutoRetry and update maybeScheduleAutoRetry
signature to accept that flag instead of regexing errorMessage; inside
maybeScheduleAutoRetry use monitor.active, wasPaused, monitor.pendingRetryAt and
the new retryable flag to decide early-return, and remove/ignore
PERMANENT_ERROR_RE and the errorMessage-based check; update any call sites that
previously passed errorMessage (where classifyHttpStatus is used) to compute and
pass the retryable boolean from classifyHttpStatus and continue logging
errorMessage as before.

pauseReason: text("pause_reason"),
healthAlertSentAt: timestamp("health_alert_sent_at"),
lastHealthyAt: timestamp("last_healthy_at"),
pendingRetryAt: timestamp("pending_retry_at"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== shared/routes.ts monitor schemas =="
rg -n -C4 'pendingRetryAt|insertMonitorSchema|api\.monitors\.(list|get)|monitor.*(schema|response)' shared/routes.ts || true

echo
echo "== storage monitor getters =="
rg -n -C12 'getMonitorsWithTags|getMonitorWithTags|getAllActiveMonitors' server/storage.ts || true

echo
echo "== routes returning monitor payloads =="
rg -n -C2 'res\.json\(monitors\)|res\.json\(monitor\)' server/routes.ts || true

Repository: bd73-com/fetchthechange

Length of output: 8031


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Verify z.custom behavior and type enforcement =="
rg -n 'z\.custom' shared/routes.ts -A2 -B2

echo
echo "== Check for response middleware/interceptors =="
rg -n 'middleware|interceptor|filter.*response|pick.*monitor' server/routes.ts | head -20 || true

echo
echo "== Confirm pendingRetryAt is in monitors.$inferSelect by checking schema =="
rg -n 'pendingRetryAt' shared/schema.ts

Repository: bd73-com/fetchthechange

Length of output: 3944


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Confirm pendingRetryAt column in monitors table =="
rg -n 'pendingRetryAt' shared/schema.ts -B2 -A2

echo
echo "== Check monitors table definition start =="
rg -n 'export const monitors' shared/schema.ts -A30

Repository: bd73-com/fetchthechange

Length of output: 3171


Critical: pendingRetryAt leaks in monitor GET/list responses due to missing response schema omit.

The z.custom<typeof monitors.$inferSelect>() response schemas at lines 131, 139, 149, 159, and 222 in shared/routes.ts provide TypeScript type hints only—they perform zero runtime validation or field filtering. Since pendingRetryAt is now in the monitors table schema, monitors.$inferSelect includes it. The storage getters (getMonitorsWithTags, getMonitorWithTags) execute bare db.select().from(monitors) queries that retrieve all columns, and the route handlers serialize these directly via res.json() without filtering. The field will be exposed to API clients despite being designated server-managed and omitted from insertMonitorSchema.

Update response schemas in shared/routes.ts to explicitly exclude pendingRetryAt:

200: z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()

Or filter in storage getters before returning. Verify all monitor response endpoints (list, get, create, update, setTags).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/schema.ts` at line 30, The monitor response currently leaks the
server-only pendingRetryAt column because routes use z.custom<typeof
monitors.$inferSelect>() (which does no runtime filtering) and storage getters
getMonitorsWithTags / getMonitorWithTags select all columns; update the response
schemas in shared/routes.ts for all monitor endpoints (list, get, create,
update, setTags) to explicitly omit pendingRetryAt (e.g. use
z.custom<Omit<typeof monitors.$inferSelect, 'pendingRetryAt'>>()) or
alternatively strip pendingRetryAt in the storage getters (getMonitorsWithTags
and getMonitorWithTags) before returning results so the field is never
serialized to clients.

- Add HTTP 4xx permanent error patterns (401, 403, 404, 410) to
  PERMANENT_ERROR_RE so monitors failing with permanent HTTP status
  codes don't schedule unnecessary auto-retries
- Add test: SSRF-blocked URLs don't trigger auto-retry
- Add test: HTTP 404 permanent errors don't trigger auto-retry

https://claude.ai/code/session_013FQG661RB6D6wMie2REkL6
@bd73-com bd73-com merged commit dc258cc into main Apr 1, 2026
2 checks passed
@bd73-com bd73-com deleted the claude/monitor-auto-retry-mQxkV branch April 1, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants