Conversation
- Skip update when already running target version (breaks re-exec loop) - Shut down all pipelines before syscall.Exec to prevent port conflicts
Greptile SummaryThis PR fixes two related issues in the agent's self-update flow: an infinite Changes:
Correctness: The version guard is well-placed — it runs before the Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant A1 as Agent v1
participant S as Server
participant FS as Filesystem
participant A2 as Agent v2
A1->>S: pollAndApply()
S-->>A1: pendingAction {self_update, targetVersion: v2}
Note over A1: Version check: "v1" ≠ "v2" → proceed
A1->>FS: Download + verify binary
A1->>FS: os.Rename(tmp → execPath)
A1->>A1: supervisor.ShutdownAll() (≤30s)
A1->>A2: syscall.Exec (process replaced)
A2->>S: pollAndApply()
S-->>A2: pendingAction {self_update, targetVersion: v2}
Note over A2: Version check: "v2" == "v2" → SKIP (loop prevented)
A2->>S: sendHeartbeat()
S-->>A2: pendingAction cleared
Last reviewed commit: b4b3fae |
| // Shut down all pipelines before re-exec to avoid orphaned Vector processes | ||
| // and port conflicts when the new agent starts them again. | ||
| slog.Info("stopping all pipelines before re-exec") | ||
| a.supervisor.ShutdownAll() |
There was a problem hiding this comment.
ShutdownAll can block up to 30 seconds post-rename
ShutdownAll() sends SIGTERM to every Vector process and then waits up to 30 seconds for each one (they run in parallel, so worst-case is 30 s total). This happens after the new binary has already been atomically renamed into place at execPath. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.
The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping ShutdownAll in a context with a hard deadline to make the bound explicit:
// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/agent/updater.go
Line: 90
Comment:
**`ShutdownAll` can block up to 30 seconds post-rename**
`ShutdownAll()` sends `SIGTERM` to every Vector process and then waits up to **30 seconds** for each one (they run in parallel, so worst-case is 30 s total). This happens *after* the new binary has already been atomically renamed into place at `execPath`. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.
The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping `ShutdownAll` in a context with a hard deadline to make the bound explicit:
```go
// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
syscall.Execloop where the new process polls, gets the samependingAction, and re-execs again before a heartbeat can clear itsyscall.Execto prevent orphaned processes and port binding conflictsContext
After a successful update, the agent re-execs via
syscall.Exec. The new process starts, polls config immediately, and gets the samependingActionbecause no heartbeat was sent to clear it. This creates a tight loop of download → verify → replace → re-exec → repeat every ~1 second.Test plan