Skip to content

fix: agent reconciles pipeline directory on every poll#10

Merged
TerrifiedBug merged 2 commits intomainfrom
fix/agent-pipeline-reconciliation
Mar 5, 2026
Merged

fix: agent reconciles pipeline directory on every poll#10
TerrifiedBug merged 2 commits intomainfrom
fix/agent-pipeline-reconciliation

Conversation

@TerrifiedBug
Copy link
Owner

Summary

  • After each successful poll, the agent scans the pipelines/ directory and removes any YAML files not present in the server response
  • Also removes companion .vf-metrics.yaml sidecar files for orphaned pipelines
  • Extracts a configFetcher interface to enable unit testing the poller without HTTP

Problem

When a node was deleted and re-enrolled into a different environment, pipeline YAML files from the previous environment remained on disk. The poller's in-memory known map started empty on re-enrollment, so it had no awareness of pre-existing files. This caused the agent to run pipelines from both the old and new environments.

Changes

  • agent/internal/agent/poller.go — added directory reconciliation block + configFetcher interface
  • agent/internal/agent/poller_test.go — two tests covering orphan removal and empty-response cleanup

Test plan

  • TestPoll_RemovesOrphanedPipelineFiles — orphans deleted, valid pipelines and non-YAML files preserved
  • TestPoll_EmptyResponseCleansAllFiles — empty server response removes all YAML files
  • Manual: enroll agent, deploy pipeline, delete node, re-enroll into different env, verify old pipeline YAML is removed on first poll

After each successful poll, scan the pipelines directory and remove any
YAML files (and their .vf-metrics.yaml sidecars) not present in the
server response. This prevents stale configs from accumulating after
node re-enrollment into a different environment or missed undeploys.

Also extracts a configFetcher interface from the poller to enable
unit testing without an HTTP server.
@greptile-apps
Copy link

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR fixes orphaned pipeline YAML files persisting on disk after a node is re-enrolled into a different environment by adding a disk-reconciliation pass at the end of Poll(), and extracts a configFetcher interface to enable unit testing without HTTP. The reconciliation correctly scans pipelines/ after each poll and removes any .yaml files (plus their .vf-metrics.yaml sidecars) that are absent from the server response.

  • New configFetcher interface decouples the poller from *client.Client, enabling the two new unit tests to run without a live server.
  • Reconciliation block (lines 154–179) correctly handles the re-enrollment orphan scenario: files on disk that are neither in the current server response nor in the in-memory known map.
  • Gap in the graceful-stop path: the existing stop loop (lines 142–152) deletes the main <id>.yaml file before the reconciliation pass runs. Because the main file is already gone when ReadDir is called, the reconciliation loop never encounters the entry and does not clean up the companion <id>.yaml.vf-metrics.yaml sidecar. Metrics sidecar files will accumulate for every pipeline that is gracefully undeployed through the normal known-map path.
  • Both new tests are well-structured and cover the two intended scenarios.

Confidence Score: 3/5

  • Safe to merge with the sidecar-cleanup gap in the graceful-stop path addressed first.
  • The core reconciliation logic is correct and the new tests validate it well. However, there is a real logic gap: the existing graceful-stop loop removes the main YAML before the reconciliation pass runs, meaning .vf-metrics.yaml sidecars are never cleaned up for normally-undeployed pipelines. This causes silent disk accumulation and directly contradicts a stated goal of the PR (removing sidecar files for orphaned pipelines).
  • agent/internal/agent/poller.go — specifically the graceful-stop loop at lines 142–152.

Comments Outside Diff (1)

  1. agent/internal/agent/poller.go, line 142-152 (link)

    Metrics sidecar not cleaned up in the graceful-stop path

    The existing stop block removes the main .yaml file on line 150 but never removes the companion .vf-metrics.yaml sidecar. Because the main file is deleted here (before the reconciliation pass runs), os.ReadDir will no longer list it, so the reconciliation loop at lines 154–179 will never encounter the entry and will not trigger the sidecar cleanup either. The result is that gracefully-stopped pipelines (i.e. those that go through p.known) leave their sidecar files on disk indefinitely.

    The fix is to mirror the reconciliation block's two-remove pattern here:

Last reviewed commit: 851efcb

Comment on lines +156 to +157
entries, readErr := os.ReadDir(pipelinesDir)
if readErr == nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent reconciliation failure

If os.ReadDir fails (e.g., a transient permission or I/O error), the entire reconciliation is silently skipped with no log message. This makes the failure mode invisible — orphaned files would accumulate with no diagnostic, and the only clue would be the missing "removing orphaned pipeline config" log lines.

Consider logging the error before bailing:

Suggested change
entries, readErr := os.ReadDir(pipelinesDir)
if readErr == nil {
entries, readErr := os.ReadDir(pipelinesDir)
if readErr != nil {
slog.Warn("failed to read pipelines dir for reconciliation", "error", readErr)
} else {

Comment on lines +168 to +171
slog.Warn("removing orphaned pipeline config", "pipelineId", id)
os.Remove(filepath.Join(pipelinesDir, name))
os.Remove(filepath.Join(pipelinesDir, name+".vf-metrics.yaml"))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlogged os.Remove errors in reconciliation block

Both os.Remove calls discard their return values silently, even though a slog.Warn is already emitted a line above. Compare this to the supervisor's equivalent cleanup at supervisor.go:205–206, which logs a warning when the remove fails. If an orphaned file can't be deleted (e.g., locked by another process or a permission issue), there's no indication of the failure.

Suggested change
slog.Warn("removing orphaned pipeline config", "pipelineId", id)
os.Remove(filepath.Join(pipelinesDir, name))
os.Remove(filepath.Join(pipelinesDir, name+".vf-metrics.yaml"))
}
slog.Warn("removing orphaned pipeline config", "pipelineId", id)
if err := os.Remove(filepath.Join(pipelinesDir, name)); err != nil && !os.IsNotExist(err) {
slog.Warn("failed to remove orphaned pipeline config", "path", name, "error", err)
}
if err := os.Remove(filepath.Join(pipelinesDir, name+".vf-metrics.yaml")); err != nil && !os.IsNotExist(err) {
slog.Warn("failed to remove orphaned metrics sidecar", "path", name+".vf-metrics.yaml", "error", err)
}

Log ReadDir errors during reconciliation instead of silently skipping.
Log os.Remove failures for orphaned files, consistent with supervisor
cleanup pattern. Guard with !os.IsNotExist to avoid noisy logs.
@TerrifiedBug TerrifiedBug merged commit 46ec714 into main Mar 5, 2026
9 checks passed
@TerrifiedBug TerrifiedBug deleted the fix/agent-pipeline-reconciliation branch March 5, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant