fix: stop infinite self-update retry loop on failure by TerrifiedBug · Pull Request #15 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-05T18:15:42Z

Summary

Agent got stuck in infinite update retry loop when self-update failed (e.g., checksum mismatch from stale server cache)
The pendingAction on the VectorNode was never cleared on failure, so the server kept sending it every 15s

Fix

Agent: tracks failedUpdateVersion in memory — skips retries for same version
Agent: reports updateError in heartbeat payload (sent once, then cleared)
Server: clears pendingAction when agent reports update failure via heartbeat

Test plan

Trigger an update with a bad checksum — verify agent logs one failure then stops retrying
Verify pendingAction is cleared in DB after failure report
Verify successful updates still work (syscall.Exec replaces process)
Verify a new update for a different version is still attempted after a previous failure

When a self-update failed (e.g., checksum mismatch from stale server cache), the pendingAction was never cleared. The server kept sending it every 15s config poll, causing an infinite retry loop. Fix: - Agent tracks failed update version and skips retries for same version - Agent reports updateError in heartbeat payload - Server clears pendingAction when agent reports update failure

greptile-apps · 2026-03-05T18:24:09Z

Greptile Summary

This PR fixes the infinite self-update retry loop by tracking the failed version in agent memory (failedUpdateVersion) and reporting the error to the server via the heartbeat payload so the server can clear pendingAction, breaking the 15-second retry cycle. The overall approach is sound, but two correctness gaps were found.

Agent (agent.go): Docker mode sets updateError correctly but never sets failedUpdateVersion, so the retry-skip guard is bypassed for Docker agents — each poll cycle re-enters the Docker branch for as long as the pendingAction remains on the server.
Server (route.ts): clearPendingAction is set unconditionally when updateError arrives, without verifying the current DB pendingAction is still a self_update. A stale error delivered after an admin has manually rescheduled a new update will silently clear the new action.
Client (client.go): The UpdateError field addition is clean and correct.

Confidence Score: 3/5

Safe to merge with caveats — the fix works for the common case but has two logic gaps that should be addressed before this can be considered fully correct.
The core mechanism (skip-by-version on agent, clear-on-error on server) is correct for the primary scenario. However, the Docker path never sets failedUpdateVersion, leaving Docker agents in a continuous re-entry loop if heartbeats are slow, and the server-side clearing logic doesn't guard against stale errors arriving after a new pending action has been scheduled, which can silently swallow legitimate update commands.
agent/internal/agent/agent.go (Docker branch missing failedUpdateVersion = action.TargetVersion) and src/app/api/agent/heartbeat/route.ts (unconditional clearPendingAction without type-checking the current DB action).

Important Files Changed

Filename	Overview
agent/internal/agent/agent.go	Adds `failedUpdateVersion` and `updateError` fields to track and report self-update failures; the Docker branch sets `updateError` but never sets `failedUpdateVersion`, so the retry-skip guard never fires for Docker agents.
agent/internal/client/client.go	Minimal, clean change — adds the `UpdateError string` field with `omitempty` to `HeartbeatRequest`; no issues.
src/app/api/agent/heartbeat/route.ts	Adds `updateError` to the Zod schema (correctly capped at 500 chars) and clears `pendingAction` on failure; `clearPendingAction` is set unconditionally without checking the current DB `pendingAction` type, risking clearing an unrelated newly-scheduled action.

Sequence Diagram

sequenceDiagram
    participant A as Agent (Go)
    participant S as Server (/api/agent/heartbeat)
    participant DB as PostgreSQL

    Note over A,DB: Normal failure path (non-Docker)
    S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
    A->>A: handleSelfUpdate() → error (checksum mismatch)
    A->>A: failedUpdateVersion = "v1.2.0"
    A->>A: updateError = "checksum mismatch"
    A->>S: POST /heartbeat { updateError: "checksum mismatch" }
    S->>DB: UPDATE vectorNode SET pendingAction = NULL
    S->>A: 200 OK
    A->>A: updateError = "" (cleared)
    Note over A: Next poll: TargetVersion == failedUpdateVersion → skip

    Note over A,DB: Docker path (bug: failedUpdateVersion never set)
    S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
    A->>A: deploymentMode == DOCKER → updateError = "running in Docker"
    Note over A: failedUpdateVersion stays "" ← bug
    A->>S: POST /heartbeat { updateError: "running in Docker" }
    S->>DB: UPDATE vectorNode SET pendingAction = NULL
    S->>A: 200 OK
    A->>A: updateError = "" (cleared)
    Note over A: If heartbeat had failed, next poll re-enters Docker branch again

    Note over A,DB: Stale error race (server-side bug)
    A->>A: fails v1.2.0 → updateError = "X", heartbeat fails
    Note over DB: Admin schedules new pendingAction v1.2.1
    A->>S: POST /heartbeat { updateError: "X" } (stale)
    S->>DB: clearPendingAction = true (unconditional) ← clears v1.2.1 action
    Note over S: Should verify pendingAction.type == self_update first

Comments Outside Diff (2)

agent/internal/agent/agent.go, line 148-153 (link)

Docker path never sets failedUpdateVersion

The Docker branch sets updateError but never sets a.failedUpdateVersion. This means the early-exit guard:
```
if action.TargetVersion == a.failedUpdateVersion {
    return // already failed for this version, don't retry
}
```
will never fire for Docker agents. If heartbeats are failing (e.g. transient network issue) while pendingAction is still present on the server, handlePendingAction will re-enter the Docker branch on every single poll cycle — continuously re-setting updateError = "running in Docker". This is idempotent in practice, but it means Docker agents don't get the same retry-protection that non-Docker agents do.
src/app/api/agent/heartbeat/route.ts, line 132-136 (link)

Stale updateError can clear an unrelated new pendingAction

clearPendingAction is set unconditionally whenever updateError is present — without checking what pendingAction is currently in the DB. This creates a narrow but real race:
1. Agent fails self-update for v1.2.0, sets updateError = "checksum mismatch"
2. Heartbeat fails transiently
3. An operator clears the failed action and schedules a NEW pendingAction for v1.2.1
4. The agent's next heartbeat delivers the stale updateError → server silently clears the v1.2.1 action
The agent would still attempt v1.2.1 on its next poll (since failedUpdateVersion == "v1.2.0" ≠ "v1.2.1"), so this is a temporary DB inconsistency rather than a hard failure. However, the server UI would briefly show no pending action while the update is actually still in-flight with no record that v1.2.1 was the target.

The fix is to fetch and inspect the current pendingAction before clearing — similar to the agentVersion path below — and only clear if the action type is still self_update:
```
if (updateError) {
  const currentNode = await prisma.vectorNode.findUnique({
    where: { id: agent.nodeId },
    select: { pendingAction: true },
  });
  if (currentNode?.pendingAction) {
    const action = currentNode.pendingAction as { type: string };
    if (action.type === "self_update") {
      clearPendingAction = true;
    }
  }
  console.warn("Agent update failed, clearing pending action:", agent.nodeId, updateError);
}
```

_{Last reviewed commit: cbbc7dd}

greptile-apps · 2026-03-05T18:24:16Z

src/app/api/agent/heartbeat/route.ts

    schema: z.array(z.object({ path: z.string(), type: z.string(), sample: z.string() })).optional(),
    error: z.string().optional(),
  })).optional(),
+  updateError: z.string().optional(),


All other string fields in this schema — agentVersion and vectorVersion — are capped with .max(100). updateError has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to console.warn().

Suggested change

updateError: z.string().optional(),

updateError: z.string().max(500).optional(),

Context Used: Rule from dashboard - ## Security & Cryptography Review Rules

When reviewing changes to authentication, authorization, en... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: src/app/api/agent/heartbeat/route.ts Line: 69 Comment: All other string fields in this schema — `agentVersion` and `vectorVersion` — are capped with `.max(100)`. `updateError` has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to `console.warn()`. ```suggestion updateError: z.string().max(500).optional(), ``` **Context Used:** Rule from `dashboard` - ## Security & Cryptography Review Rules When reviewing changes to authentication, authorization, en... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2)) How can I resolve this? If you propose a fix, please make it concise.

Move updateError clearing to the success path so it retries on the next heartbeat if delivery fails, matching the existing sampleResults retry pattern.

github-actions bot added fix agent labels Mar 5, 2026

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

TerrifiedBug added 2 commits March 5, 2026 18:24

fix: cap updateError field length in heartbeat schema

2cd016b

fix: retry updateError delivery on heartbeat failure

cbbc7dd

Move updateError clearing to the success path so it retries on the next heartbeat if delivery fails, matching the existing sampleResults retry pattern.

TerrifiedBug merged commit 115f96f into main Mar 5, 2026
10 checks passed

TerrifiedBug deleted the fix/update-retry-loop branch March 5, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: stop infinite self-update retry loop on failure#15

fix: stop infinite self-update retry loop on failure#15
TerrifiedBug merged 3 commits intomainfrom
fix/update-retry-loop

TerrifiedBug commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	updateError: z.string().optional(),
	updateError: z.string().max(500).optional(),

Conversation

TerrifiedBug commented Mar 5, 2026

Summary

Fix

Test plan

Uh oh!

greptile-apps bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 5, 2026 •

edited

Loading