Skip to content

[INSTA-73939] Add INSTANA_PERSIST_HOST_UNIQUE_ID env var to fresh agent installations#459

Open
cvrkota wants to merge 13 commits intomainfrom
add-fqdn
Open

[INSTA-73939] Add INSTANA_PERSIST_HOST_UNIQUE_ID env var to fresh agent installations#459
cvrkota wants to merge 13 commits intomainfrom
add-fqdn

Conversation

@cvrkota
Copy link
Contributor

@cvrkota cvrkota commented Feb 11, 2026

Summary

This PR enables agent ID persistence across pod restarts by automatically setting the INSTANA_PERSIST_HOST_UNIQUE_ID environment variable on new deployments and mounting /var/lib/instana for storage. The changes are upgrade-safe and preserve existing configurations.

Changes

Environment Variable Management:

  • New deployments automatically receive INSTANA_PERSIST_HOST_UNIQUE_ID=true
  • Upgrades preserve existing behavior (no env var added if not present)
  • User-defined values via agent.pod.env take precedence

Volume Configuration:

  • Added /var/lib/instana volume mount with DirectoryOrCreate type
  • Scoped to /var/lib/instana (not entire /var/lib) to minimize CVE exposure
  • Agent ID persists to /var/lib/instana/instana-agent-id

Implementation:

  • Controller detects new vs upgrade scenarios by checking existing DaemonSet
  • Multi-zone deployments handle each zone independently
  • Comprehensive test coverage including E2E validation

Testing

  • Unit tests verify conditional env var setting and volume configuration
  • E2E test confirms agent ID persistence across pod restarts
  • All existing tests updated and passing

References

Checklist

  • Backwards compatible?
  • Release notes in public docs updated?
  • unit/e2e test coverage added or updated?

Note: Remember to run a helm chart release after the operator release to make the changes available through helm.

@cvrkota cvrkota requested a review from a team as a code owner February 11, 2026 16:08
Comment on lines +82 to +96
if err != nil {
if apierrors.IsNotFound(err) {
log.V(1).
Info("DaemonSet not found, will set INSTANA_PERSIST_HOST_UNIQUE_ID", "daemonset", dsName)
return true, reconcileContinue()
}
// On other errors, log and default to not setting it to be safe
log.Error(
err,
"failed to check existing DaemonSet, will not set INSTANA_PERSIST_HOST_UNIQUE_ID",
"daemonset",
dsName,
)
return false, reconcileContinue()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When r.client.Get fails for reasons other than NotFound (for example transient API server/network errors), this code logs and returns false with reconcileContinue(), so reconciliation proceeds with a desired DaemonSet that omits INSTANA_PERSIST_HOST_UNIQUE_ID. In that failure mode, an existing DaemonSet that already had the env var can be patched to remove it, which violates the intended "keep if already present" behavior and can change host identity persistence unexpectedly; this path should return a reconcile failure/requeue instead of continuing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a potential patch

konrad-ohms
konrad-ohms previously approved these changes Feb 19, 2026
@cvrkota cvrkota changed the title [INSTA-73939] Add FQDN env var to fresh agent installations [INSTA-73939] Add INSTANA_PERSIST_HOST_UNIQUE_ID env var to fresh agent installations Mar 9, 2026
@konrad-ohms
Copy link
Contributor

Can you please check the test case results? Looks like the e2e test is not working yet:

=== RUN   TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/verify_INSTANA_PERSIST_HOST_UNIQUE_ID_env_var_is_set
    agent_id_persistence_test.go:54: Verifying INSTANA_PERSIST_HOST_UNIQUE_ID environment variable is set
    agent_id_persistence_test.go:95: ✓ INSTANA_PERSIST_HOST_UNIQUE_ID is set to 'true'
=== RUN   TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/get_initial_agent_ID_from_pod
    agent_id_persistence_test.go:111: Getting initial agent ID from pod
    agent_id_persistence_test.go:127: Reading agent ID from pod: instana-agent-rwzdg
    agent_id_persistence_test.go:140: cat: /var/lib/instana/instana-agent-id: No such file or directory
        
    agent_id_persistence_test.go:141: Failed to read agent ID from pod: command terminated with exit code 1
--- FAIL: TestAgentIDPersistence (59.30s)
    --- FAIL: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts (57.14s)
        --- PASS: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/wait_for_instana-agent-controller-manager_deployment_to_become_ready (20.16s)
        --- PASS: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/wait_for_k8sensor_deployment_to_become_ready (5.17s)
        --- PASS: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/wait_for_agent_daemonset_to_become_ready (5.13s)
        --- PASS: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/verify_INSTANA_PERSIST_HOST_UNIQUE_ID_env_var_is_set (0.17s)
        --- FAIL: TestAgentIDPersistence/agent_ID_persistence_across_pod_restarts/get_initial_agent_ID_from_pod (0.42s)
FAIL

@cvrkota
Copy link
Contributor Author

cvrkota commented Mar 12, 2026

We were waiting on the agent release that ran yesterday to be able to run this release. Tested the changes locally in the OCP cluster:

2026-03-12T13:35:07.087GMT | INFO  | features-3-thread-1              | rsistenceService | com.instana.agent - 1.1.768 | persistToDisk | Successfully persisted agent ID to /proc/1/root/var/lib/instana/instana-agent-id

With the new volume mount, the file is present on the pod:

[root@worker1 instana]# cat /var/lib/instana-agent-id
2e1f31134211993c

After restarting the pod, the ID is persisted on the node.

Retriggering the pipeline

@cvrkota cvrkota force-pushed the add-fqdn branch 2 times, most recently from 06d517d to f3ac56c Compare March 12, 2026 13:58
Milica-Cvrkota-IBM and others added 13 commits March 12, 2026 14:58
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Return reconcile failure when reading an existing daemonset fails with non-NotFound errors, and propagate that failure in zoned daemonset builder selection to prevent unintended env var removal.
Add tests for helper failure behavior, zoned builder failure propagation, and apply short-circuit when daemonset reads fail.

Signed-off-by: Konrad Ohms <konrad.ohms@de.ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
Signed-off-by: Milica Cvrkota <milica.cvrkota@ibm.com>
@cvrkota cvrkota requested a review from konrad-ohms March 12, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants