Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions agent/internal/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ func (a *Agent) handlePendingAction(action *client.PendingAction) {
a.updateError = "running in Docker"
return
}
if action.TargetVersion == Version {
slog.Debug("already running target version, skipping update", "version", Version)
return
}
if action.TargetVersion == a.failedUpdateVersion {
return // already failed for this version, don't retry
}
Expand Down
5 changes: 5 additions & 0 deletions agent/internal/agent/updater.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ func (a *Agent) handleSelfUpdate(action *client.PendingAction) error {
return fmt.Errorf("replace executable: %w", err)
}

// Shut down all pipelines before re-exec to avoid orphaned Vector processes
// and port conflicts when the new agent starts them again.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShutdownAll can block up to 30 seconds post-rename

ShutdownAll() sends SIGTERM to every Vector process and then waits up to 30 seconds for each one (they run in parallel, so worst-case is 30 s total). This happens after the new binary has already been atomically renamed into place at execPath. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.

The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping ShutdownAll in a context with a hard deadline to make the bound explicit:

// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
Prompt To Fix With AI
This is a comment left during a code review.
Path: agent/internal/agent/updater.go
Line: 90

Comment:
**`ShutdownAll` can block up to 30 seconds post-rename**

`ShutdownAll()` sends `SIGTERM` to every Vector process and then waits up to **30 seconds** for each one (they run in parallel, so worst-case is 30 s total). This happens *after* the new binary has already been atomically renamed into place at `execPath`. If the agent process is killed during this window (OOM, stray signal, hardware reset), the new binary sits in place but the agent isn't running — there's nothing to auto-restart it.

The window is bounded and the parallel shutdown keeps it at one timeout cycle regardless of pipeline count, so this is a reasonable trade-off. It's worth a comment calling out the intentional 30 s window, and potentially wrapping `ShutdownAll` in a context with a hard deadline to make the bound explicit:

```go
// ShutdownAll waits for all Vector processes to exit (up to 30 s each,
// running in parallel) before we re-exec. The new binary is already in
// place at execPath; if this process is killed here the agent will need
// to be restarted manually.
slog.Info("stopping all pipelines before re-exec")
a.supervisor.ShutdownAll()
```

How can I resolve this? If you propose a fix, please make it concise.


slog.Info("binary replaced, re-executing", "path", execPath, "version", action.TargetVersion)

// Re-exec the process — this replaces the current process entirely
Expand Down
Loading