fix: stop infinite self-update retry loop on failure#15
Conversation
When a self-update failed (e.g., checksum mismatch from stale server cache), the pendingAction was never cleared. The server kept sending it every 15s config poll, causing an infinite retry loop. Fix: - Agent tracks failed update version and skips retries for same version - Agent reports updateError in heartbeat payload - Server clears pendingAction when agent reports update failure
Greptile SummaryThis PR fixes the infinite self-update retry loop by tracking the failed version in agent memory (
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Agent (Go)
participant S as Server (/api/agent/heartbeat)
participant DB as PostgreSQL
Note over A,DB: Normal failure path (non-Docker)
S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
A->>A: handleSelfUpdate() → error (checksum mismatch)
A->>A: failedUpdateVersion = "v1.2.0"
A->>A: updateError = "checksum mismatch"
A->>S: POST /heartbeat { updateError: "checksum mismatch" }
S->>DB: UPDATE vectorNode SET pendingAction = NULL
S->>A: 200 OK
A->>A: updateError = "" (cleared)
Note over A: Next poll: TargetVersion == failedUpdateVersion → skip
Note over A,DB: Docker path (bug: failedUpdateVersion never set)
S->>A: pendingAction { type: self_update, targetVersion: v1.2.0 }
A->>A: deploymentMode == DOCKER → updateError = "running in Docker"
Note over A: failedUpdateVersion stays "" ← bug
A->>S: POST /heartbeat { updateError: "running in Docker" }
S->>DB: UPDATE vectorNode SET pendingAction = NULL
S->>A: 200 OK
A->>A: updateError = "" (cleared)
Note over A: If heartbeat had failed, next poll re-enters Docker branch again
Note over A,DB: Stale error race (server-side bug)
A->>A: fails v1.2.0 → updateError = "X", heartbeat fails
Note over DB: Admin schedules new pendingAction v1.2.1
A->>S: POST /heartbeat { updateError: "X" } (stale)
S->>DB: clearPendingAction = true (unconditional) ← clears v1.2.1 action
Note over S: Should verify pendingAction.type == self_update first
|
src/app/api/agent/heartbeat/route.ts
Outdated
| schema: z.array(z.object({ path: z.string(), type: z.string(), sample: z.string() })).optional(), | ||
| error: z.string().optional(), | ||
| })).optional(), | ||
| updateError: z.string().optional(), |
There was a problem hiding this comment.
All other string fields in this schema — agentVersion and vectorVersion — are capped with .max(100). updateError has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to console.warn().
| updateError: z.string().optional(), | |
| updateError: z.string().max(500).optional(), |
Context Used: Rule from dashboard - ## Security & Cryptography Review Rules
When reviewing changes to authentication, authorization, en... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/api/agent/heartbeat/route.ts
Line: 69
Comment:
All other string fields in this schema — `agentVersion` and `vectorVersion` — are capped with `.max(100)`. `updateError` has no size limit, so a misbehaving or compromised agent could send an arbitrarily large string that gets passed to `console.warn()`.
```suggestion
updateError: z.string().max(500).optional(),
```
**Context Used:** Rule from `dashboard` - ## Security & Cryptography Review Rules
When reviewing changes to authentication, authorization, en... ([source](https://app.greptile.com/review/custom-context?memory=7cb20c56-ca6a-40aa-8660-7fa75e6e3db2))
How can I resolve this? If you propose a fix, please make it concise.Move updateError clearing to the success path so it retries on the next heartbeat if delivery fails, matching the existing sampleResults retry pattern.
Summary
pendingActionon the VectorNode was never cleared on failure, so the server kept sending it every 15sFix
failedUpdateVersionin memory — skips retries for same versionupdateErrorin heartbeat payload (sent once, then cleared)pendingActionwhen agent reports update failure via heartbeatTest plan