Skip to content

fix: stop infinite self-update retry loop on failure#15

Merged
TerrifiedBug merged 3 commits intomainfrom
fix/update-retry-loop
Mar 5, 2026
Merged

fix: stop infinite self-update retry loop on failure#15
TerrifiedBug merged 3 commits intomainfrom
fix/update-retry-loop

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • Agent got stuck in infinite update retry loop when self-update failed (e.g., checksum mismatch from stale server cache)
  • The pendingAction on the VectorNode was never cleared on failure, so the server kept sending it every 15s

Fix

  • Agent: tracks failedUpdateVersion in memory — skips retries for same version
  • Agent: reports updateError in heartbeat payload (sent once, then cleared)
  • Server: clears pendingAction when agent reports update failure via heartbeat

Test plan

  • Trigger an update with a bad checksum — verify agent logs one failure then stops retrying
  • Verify pendingAction is cleared in DB after failure report
  • Verify successful updates still work (syscall.Exec replaces process)
  • Verify a new update for a different version is still attempted after a previous failure

When a self-update failed (e.g., checksum mismatch from stale server
cache), the pendingAction was never cleared. The server kept sending it
every 15s config poll, causing an infinite retry loop.

Fix:
- Agent tracks failed update version and skips retries for same version
- Agent reports updateError in heartbeat payload
- Server clears pendingAction when agent reports update failure
@greptile-apps
Copy link

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes the infinite self-update retry loop by tracking the failed version in agent memory (failedUpdateVersion) and reporting the error to the server via the heartbeat payload so the server can clear pendingAction, breaking the 15-second retry cycle. The overall approach is sound, but two correctness gaps were found.

  • Agent (agent.go): Docker mode sets updateError correctly but never sets failedUpdateVersion, so the retry-skip guard is bypassed for Docker agents — each poll cycle re-enters the Docker branch for as long as the pendingAction remains on the server.
  • Server (route.ts): clearPendingAction is set unconditionally when updateError arrives, without verifying the current DB pendingAction is still a self_update. A stale error delivered after an admin has manually rescheduled a new update will silently clear the new action.
  • Client (client.go): The UpdateError field addition is clean and correct.

Confidence Score: 3/5

  • Safe to merge with caveats — the fix works for the common case but has two logic gaps that should be addressed before this can be considered fully correct.
  • The core mechanism (skip-by-version on agent, clear-on-error on server) is correct for the primary scenario. However, the Docker path never sets failedUpdateVersion, leaving Docker agents in a continuous re-entry loop if heartbeats are slow, and the server-side clearing logic doesn't guard against stale errors arriving after a new pending action has been scheduled, which can silently swallow legitimate update commands.
  • agent/internal/agent/agent.go (Docker branch missing failedUpdateVersion = action.TargetVersion) and src/app/api/agent/heartbeat/route.ts (unconditional clearPendingAction without type-checking the current DB action).

Important Files Changed

Filename Overview
agent/internal/agent/agent.go Adds failedUpdateVersion and updateError fields to track and report self-update failures; the Docker branch sets updateError but never sets failedUpdateVersion, so the retry-skip guard never fires for Docker agents.
agent/internal/client/client.go Minimal, clean change — adds the UpdateError string field with omitempty to HeartbeatRequest; no issues.
src/app/api/agent/heartbeat/route.ts Adds updateError to the Zod schema (correctly capped at 500 chars) and clears pendingAction on failure; clearPendingAction is set unconditionally without checking the current DB pendingAction type, risking clearing an unrelated newly-scheduled action.

Sequence Diagram

sequenceDiagram
    participant A as Agent (Go)
    participant S as Server (/api/agent/heartbeat)
    participant DB as PostgreSQL

    Note over A,DB: Normal failure path (non-Docker)
    S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
    A->>A: handleSelfUpdate() → error (checksum mismatch)
    A->>A: failedUpdateVersion = "v1.2.0"
    A->>A: updateError = "checksum mismatch"
    A->>S: POST /heartbeat { updateError: "checksum mismatch" }
    S->>DB: UPDATE vectorNode SET pendingAction = NULL
    S->>A: 200 OK
    A->>A: updateError = "" (cleared)
    Note over A: Next poll: TargetVersion == failedUpdateVersion → skip

    Note over A,DB: Docker path (bug: failedUpdateVersion never set)
    S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
    A->>A: deploymentMode == DOCKER → updateError = "running in Docker"
    Note over A: failedUpdateVersion stays "" ← bug
    A->>S: POST /heartbeat { updateError: "running in Docker" }
    S->>DB: UPDATE vectorNode SET pendingAction = NULL
    S->>A: 200 OK
    A->>A: updateError = "" (cleared)
    Note over A: If heartbeat had failed, next poll re-enters Docker branch again

    Note over A,DB: Stale error race (server-side bug)
    A->>A: fails v1.2.0 → updateError = "X", heartbeat fails
    Note over DB: Admin schedules new pendingAction v1.2.1
    A->>S: POST /heartbeat { updateError: "X" } (stale)
    S->>DB: clearPendingAction = true (unconditional) ← clears v1.2.1 action
    Note over S: Should verify pendingAction.type == self_update first
Loading

Comments Outside Diff (2)

  1. agent/internal/agent/agent.go, line 148-153 (link)

    Docker path never sets failedUpdateVersion

    The Docker branch sets updateError but never sets a.failedUpdateVersion. This means the early-exit guard:

    if action.TargetVersion == a.failedUpdateVersion {
        return // already failed for this version, don't retry
    }

    will never fire for Docker agents. If heartbeats are failing (e.g. transient network issue) while pendingAction is still present on the server, handlePendingAction will re-enter the Docker branch on every single poll cycle — continuously re-setting updateError = "running in Docker". This is idempotent in practice, but it means Docker agents don't get the same retry-protection that non-Docker agents do.

  2. src/app/api/agent/heartbeat/route.ts, line 132-136 (link)

    Stale updateError can clear an unrelated new pendingAction

    clearPendingAction is set unconditionally whenever updateError is present — without checking what pendingAction is currently in the DB. This creates a narrow but real race:

    1. Agent fails self-update for v1.2.0, sets updateError = "checksum mismatch"
    2. Heartbeat fails transiently
    3. An operator clears the failed action and schedules a NEW pendingAction for v1.2.1
    4. The agent's next heartbeat delivers the stale updateError → server silently clears the v1.2.1 action

    The agent would still attempt v1.2.1 on its next poll (since failedUpdateVersion == "v1.2.0""v1.2.1"), so this is a temporary DB inconsistency rather than a hard failure. However, the server UI would briefly show no pending action while the update is actually still in-flight with no record that v1.2.1 was the target.

    The fix is to fetch and inspect the current pendingAction before clearing — similar to the agentVersion path below — and only clear if the action type is still self_update:

    if (updateError) {
      const currentNode = await prisma.vectorNode.findUnique({
        where: { id: agent.nodeId },
        select: { pendingAction: true },
      });
      if (currentNode?.pendingAction) {
        const action = currentNode.pendingAction as { type: string };
        if (action.type === "self_update") {
          clearPendingAction = true;
        }
      }
      console.warn("Agent update failed, clearing pending action:", agent.nodeId, updateError);
    }

Last reviewed commit: cbbc7dd

schema: z.array(z.object({ path: z.string(), type: z.string(), sample: z.string() })).optional(),
error: z.string().optional(),
})).optional(),
updateError: z.string().optional(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other string fields in this schema — agentVersion and vectorVersion — are capped with .max(100). updateError has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to console.warn().

Suggested change
updateError: z.string().optional(),
updateError: z.string().max(500).optional(),

Context Used: Rule from dashboard - ## Security & Cryptography Review Rules

When reviewing changes to authentication, authorization, en... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/api/agent/heartbeat/route.ts
Line: 69

Comment:
All other string fields in this schema — `agentVersion` and `vectorVersion` — are capped with `.max(100)`. `updateError` has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to `console.warn()`.

```suggestion
  updateError: z.string().max(500).optional(),
```

**Context Used:** Rule from `dashboard` - ## Security & Cryptography Review Rules

When reviewing changes to authentication, authorization, en... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2))

How can I resolve this? If you propose a fix, please make it concise.

Move updateError clearing to the success path so it retries on
the next heartbeat if delivery fails, matching the existing
sampleResults retry pattern.
@TerrifiedBug TerrifiedBug merged commit 115f96f into main Mar 5, 2026
10 checks passed
@TerrifiedBug TerrifiedBug deleted the fix/update-retry-loop branch March 5, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant