node condtion remediation by runzhen · Pull Request #126 · Azure/AKSFlexNode

runzhen · 2026-03-12T21:02:07Z

No description provided.

commands.go

Copilot

Pull request overview

Adds a periodic node-condition monitoring/remediation loop to the agent daemon, intended to detect the KernelDeadlock node condition and reboot the node when it occurs.

Changes:

Start a new daemon loop that queries the Kubernetes Node object on a 1-minute cadence and triggers a reboot when KernelDeadlock=True after host boot time.
Add helper functions to determine boot time, node name, and execute a reboot.
Update Go module dependencies (go.mod/go.sum) to include additional indirect dependencies pulled in by the new Kubernetes client usage.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`commands.go`	Adds node condition monitoring/remediation loop and reboot helpers.
`go.mod`	Adds new indirect dependencies required by the new code paths.
`go.sum`	Records checksums for newly introduced/updated indirect dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

commands.go

Copilot · 2026-03-12T21:05:38Z

commands.go

+	wg.Add(1)
+


startDaemonLoops pre-increments the WaitGroup counter for the loops it starts, but startNodeConditionLoop also calls wg.Add(1) internally. This hidden increment is inconsistent with the other loops and makes it easier to accidentally introduce a WaitGroup misuse/panic later; consider moving the Add(1) into startDaemonLoops and removing it from here.

Suggested change

wg.Add(1)

Copilot · 2026-03-12T21:05:38Z

commands.go

+				// Load kubeconfig
+				config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig")
+				if err != nil {
+					logger.Errorf("failed to load kubeconfig: %s", err.Error())
+					return
+				}
+
+				// Create Kubernetes clientset
+				clientset, err := kubernetes.NewForConfig(config)
+				if err != nil {
+					logger.Errorf("failed to create clientset: %s", err.Error())
+					return
+				}


clientcmd.BuildConfigFromFlags and kubernetes.NewForConfig are executed on every tick. Since these typically only depend on local files and can be reused, consider creating the REST config/clientset once outside the ticker loop and reusing them (recreating only on failure) to reduce per-minute I/O and allocations.

commands.go

Copilot · 2026-03-12T21:05:38Z

commands.go

+				node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
+				if err != nil {
+					logger.Errorf("failed to get node %s: %s", nodeName, err.Error())
+				}
+
+				hostBootTime, err := getBootTime()
+				if err != nil {
+					logger.Errorf("failed to get host boot time: %s", err.Error())
+					return
+				}
+
+				for _, condition := range node.Status.Conditions {


If the Nodes().Get call returns an error, node can be nil; the code then ranges over node.Status.Conditions, which can panic. Handle the error by returning/continuing before dereferencing node (or guard against nil).

Copilot · 2026-03-12T21:05:39Z

commands.go

+				config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig")
+				if err != nil {
+					logger.Errorf("failed to load kubeconfig: %s", err.Error())
+					return
+				}
+
+				// Create Kubernetes clientset
+				clientset, err := kubernetes.NewForConfig(config)
+				if err != nil {
+					logger.Errorf("failed to create clientset: %s", err.Error())
+					return
+				}


These error paths return from the goroutine, which permanently stops node-condition monitoring after a transient failure (e.g., kubeconfig temporarily unavailable). Prefer logging and continue to the next tick so the daemon can self-recover like the other loops in this file.

commands.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

commands.go

Copilot · 2026-03-12T21:21:21Z

commands.go

+				// Load kubeconfig
+				kubeConfig, err := clientcmd.BuildConfigFromFlags("", config.KubeletKubeconfigPath)
+				if err != nil {


config, err := clientcmd.BuildConfigFromFlags(..., "/var/lib/kubelet/kubelet/kubeconfig") hardcodes a path that already exists as config.KubeletKubeconfigPath and also introduces a local variable named config that shadows the imported pkg/config identifier. Use the shared constant and rename the local to something like kubeConfig to avoid confusion.

commands.go

Copilot · 2026-03-12T21:21:22Z

commands.go

+func rebootNode() error {
+	rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
+		"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
+
+	return rebootCmd.Run()


Direct exec.Command usage here will be flagged by the repo’s enabled gosec linter (this codebase typically suppresses with #nosec or routes through pkg/utils helpers). Consider using utils.RunSystemCommand / RunCommandWithOutput (to preserve stderr for debugging) or add an explicit suppression comment with justification.

Copilot · 2026-03-12T21:22:43Z

@runzhen I've opened a new pull request, #127, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T21:27:42Z

commands.go

+					return
+				}
+


In this loop, errors when loading kubeconfig cause a return, which stops the goroutine permanently and prevents any future node-condition checks. Consider logging the error and continue to the next tick (or implement a backoff) so transient failures don’t disable remediation for the lifetime of the agent process.

Suggested change

return

}

Copilot · 2026-03-12T21:27:42Z

commands.go

+				// Get the node
+				node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})


defer cancel() is inside a long-running for loop, so the cancels will be deferred until the goroutine exits (potentially never), leaking per-iteration resources. Scope the timeout context to a small inner function/block and call cancel() at the end of each iteration instead of deferring in the outer loop.

Suggested change

// Get the node

node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})

// Get the node with a per-call context to avoid leaking resources across iterations

ctx, cancel := context.WithCancel(context.Background())

node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})

cancel()

Copilot · 2026-03-12T21:27:42Z

commands.go

+				// Get the node
+				node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
+				if err != nil {
+					logger.Errorf("failed to get node %s: %s", nodeName, err.Error())


If the Node GET fails, the code logs the error but then continues and dereferences node.Status.Conditions, which will panic when node is nil. This should return/continue on error (and avoid attempting remediation) so the agent doesn’t crash the loop on API failures.

Copilot · 2026-03-12T21:27:42Z

commands.go

+func rebootNode() error {
+	rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
+		"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
+
+	return rebootCmd.Run()
+}


This introduces an unconditional host reboot action (via sysrq-trigger) in the main daemon loop. Consider gating this behavior behind an explicit config/feature flag (similar to EnableDriftDetectionAndRemediation) and adding rate limiting/guardrails to reduce the risk of reboot loops or unexpected reboots in environments that don’t want automated power actions.

Copilot · 2026-03-12T21:27:43Z

commands.go

+	rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
+		"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
+
+	return rebootCmd.Run()


Direct use of exec.Command here may trigger gosec (G204) and also provides no timeout or captured stderr/stdout for troubleshooting. Prefer using the repo’s command execution helper (e.g., pkg/utils.RunSystemCommand / RunCommandWithOutput) or add a scoped #nosec with justification and use exec.CommandContext with a timeout so failures are observable and the call can’t hang indefinitely.

Suggested change

rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",

"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")

return rebootCmd.Run()

// Use a bounded context so the reboot command can't hang indefinitely.

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

defer cancel()

// #nosec G204 -- command and arguments are constant literals; no user input is interpolated.

rebootCmd := exec.CommandContext(ctx, "/usr/bin/nsenter", "-m/proc/1/ns/mnt",

"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")

output, err := rebootCmd.CombinedOutput()

if ctx.Err() == context.DeadlineExceeded {

return fmt.Errorf("reboot command timed out: %w; output: %s", err, strings.TrimSpace(string(output)))

}

if err != nil {

return fmt.Errorf("reboot command failed: %w; output: %s", err, strings.TrimSpace(string(output)))

}

return nil

npd

a29703c

Copilot AI review requested due to automatic review settings March 12, 2026 21:02

runzhen temporarily deployed to e2e-testing March 12, 2026 21:02 — with GitHub Actions Inactive

Copilot started reviewing on behalf of runzhen March 12, 2026 21:02 View session

github-advanced-security bot found potential problems Mar 12, 2026

View reviewed changes

commands.go Fixed Show fixed Hide fixed

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Update commands.go

9c0d5f5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 12, 2026 21:17

runzhen temporarily deployed to e2e-testing March 12, 2026 21:17 — with GitHub Actions Inactive

Update commands.go

369dea1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

runzhen temporarily deployed to e2e-testing March 12, 2026 21:18 — with GitHub Actions Inactive

Copilot started reviewing on behalf of runzhen March 12, 2026 21:18 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Copilot AI mentioned this pull request Mar 12, 2026

Cache Kubernetes clientset outside node condition ticker loop #127

Draft

Update commands.go

5e09cf8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 12, 2026 21:23

runzhen temporarily deployed to e2e-testing March 12, 2026 21:24 — with GitHub Actions Inactive

Copilot started reviewing on behalf of runzhen March 12, 2026 21:24 View session

Apply suggestion from @Copilot

8ec560e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

runzhen deployed to e2e-testing March 12, 2026 21:25 — with GitHub Actions Active

Apply suggestion from @Copilot

1417ecb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

runzhen had a problem deploying to e2e-testing March 12, 2026 21:25 — with GitHub Actions Failure

Copilot AI reviewed Mar 12, 2026

View reviewed changes

		// Get the node
		node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})

-				// Get the node
-				node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{})
+				// Get the node with a per-call context to avoid leaking resources across iterations
+				ctx, cancel := context.WithCancel(context.Background())
+				node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+				cancel()

-	rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt",
-		"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
-	return rebootCmd.Run()
+	// Use a bounded context so the reboot command can't hang indefinitely.
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+	// #nosec G204 -- command and arguments are constant literals; no user input is interpolated.
+	rebootCmd := exec.CommandContext(ctx, "/usr/bin/nsenter", "-m/proc/1/ns/mnt",
+		"/bin/bash", "-c", "echo b > /proc/sysrq-trigger")
+	output, err := rebootCmd.CombinedOutput()
+	if ctx.Err() == context.DeadlineExceeded {
+		return fmt.Errorf("reboot command timed out: %w; output: %s", err, strings.TrimSpace(string(output)))
+	}
+	if err != nil {
+		return fmt.Errorf("reboot command failed: %w; output: %s", err, strings.TrimSpace(string(output)))
+	}
+	return nil

Conversation

runzhen commented Mar 12, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants