Conversation
There was a problem hiding this comment.
Pull request overview
Adds a periodic node-condition monitoring/remediation loop to the agent daemon, intended to detect the KernelDeadlock node condition and reboot the node when it occurs.
Changes:
- Start a new daemon loop that queries the Kubernetes Node object on a 1-minute cadence and triggers a reboot when
KernelDeadlock=Trueafter host boot time. - Add helper functions to determine boot time, node name, and execute a reboot.
- Update Go module dependencies (go.mod/go.sum) to include additional indirect dependencies pulled in by the new Kubernetes client usage.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
commands.go |
Adds node condition monitoring/remediation loop and reboot helpers. |
go.mod |
Adds new indirect dependencies required by the new code paths. |
go.sum |
Records checksums for newly introduced/updated indirect dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| wg.Add(1) | ||
|
|
There was a problem hiding this comment.
startDaemonLoops pre-increments the WaitGroup counter for the loops it starts, but startNodeConditionLoop also calls wg.Add(1) internally. This hidden increment is inconsistent with the other loops and makes it easier to accidentally introduce a WaitGroup misuse/panic later; consider moving the Add(1) into startDaemonLoops and removing it from here.
| wg.Add(1) |
| // Load kubeconfig | ||
| config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig") | ||
| if err != nil { | ||
| logger.Errorf("failed to load kubeconfig: %s", err.Error()) | ||
| return | ||
| } | ||
|
|
||
| // Create Kubernetes clientset | ||
| clientset, err := kubernetes.NewForConfig(config) | ||
| if err != nil { | ||
| logger.Errorf("failed to create clientset: %s", err.Error()) | ||
| return | ||
| } |
There was a problem hiding this comment.
clientcmd.BuildConfigFromFlags and kubernetes.NewForConfig are executed on every tick. Since these typically only depend on local files and can be reused, consider creating the REST config/clientset once outside the ticker loop and reusing them (recreating only on failure) to reduce per-minute I/O and allocations.
| node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| logger.Errorf("failed to get node %s: %s", nodeName, err.Error()) | ||
| } | ||
|
|
||
| hostBootTime, err := getBootTime() | ||
| if err != nil { | ||
| logger.Errorf("failed to get host boot time: %s", err.Error()) | ||
| return | ||
| } | ||
|
|
||
| for _, condition := range node.Status.Conditions { |
There was a problem hiding this comment.
If the Nodes().Get call returns an error, node can be nil; the code then ranges over node.Status.Conditions, which can panic. Handle the error by returning/continuing before dereferencing node (or guard against nil).
commands.go
Outdated
| config, err := clientcmd.BuildConfigFromFlags("", "/var/lib/kubelet/kubelet/kubeconfig") | ||
| if err != nil { | ||
| logger.Errorf("failed to load kubeconfig: %s", err.Error()) | ||
| return | ||
| } | ||
|
|
||
| // Create Kubernetes clientset | ||
| clientset, err := kubernetes.NewForConfig(config) | ||
| if err != nil { | ||
| logger.Errorf("failed to create clientset: %s", err.Error()) | ||
| return | ||
| } |
There was a problem hiding this comment.
These error paths return from the goroutine, which permanently stops node-condition monitoring after a transient failure (e.g., kubeconfig temporarily unavailable). Prefer logging and continue to the next tick so the daemon can self-recover like the other loops in this file.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| // Load kubeconfig | ||
| kubeConfig, err := clientcmd.BuildConfigFromFlags("", config.KubeletKubeconfigPath) | ||
| if err != nil { |
There was a problem hiding this comment.
config, err := clientcmd.BuildConfigFromFlags(..., "/var/lib/kubelet/kubelet/kubeconfig") hardcodes a path that already exists as config.KubeletKubeconfigPath and also introduces a local variable named config that shadows the imported pkg/config identifier. Use the shared constant and rename the local to something like kubeConfig to avoid confusion.
| func rebootNode() error { | ||
| rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt", | ||
| "/bin/bash", "-c", "echo b > /proc/sysrq-trigger") | ||
|
|
||
| return rebootCmd.Run() |
There was a problem hiding this comment.
Direct exec.Command usage here will be flagged by the repo’s enabled gosec linter (this codebase typically suppresses with #nosec or routes through pkg/utils helpers). Consider using utils.RunSystemCommand / RunCommandWithOutput (to preserve stderr for debugging) or add an explicit suppression comment with justification.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| return | ||
| } | ||
|
|
There was a problem hiding this comment.
In this loop, errors when loading kubeconfig cause a return, which stops the goroutine permanently and prevents any future node-condition checks. Consider logging the error and continue to the next tick (or implement a backoff) so transient failures don’t disable remediation for the lifetime of the agent process.
| return | |
| } |
| // Get the node | ||
| node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{}) |
There was a problem hiding this comment.
defer cancel() is inside a long-running for loop, so the cancels will be deferred until the goroutine exits (potentially never), leaking per-iteration resources. Scope the timeout context to a small inner function/block and call cancel() at the end of each iteration instead of deferring in the outer loop.
| // Get the node | |
| node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{}) | |
| // Get the node with a per-call context to avoid leaking resources across iterations | |
| ctx, cancel := context.WithCancel(context.Background()) | |
| node, err := clientset.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) | |
| cancel() |
| // Get the node | ||
| node, err := clientset.CoreV1().Nodes().Get(context.Background(), nodeName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| logger.Errorf("failed to get node %s: %s", nodeName, err.Error()) |
There was a problem hiding this comment.
If the Node GET fails, the code logs the error but then continues and dereferences node.Status.Conditions, which will panic when node is nil. This should return/continue on error (and avoid attempting remediation) so the agent doesn’t crash the loop on API failures.
| func rebootNode() error { | ||
| rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt", | ||
| "/bin/bash", "-c", "echo b > /proc/sysrq-trigger") | ||
|
|
||
| return rebootCmd.Run() | ||
| } |
There was a problem hiding this comment.
This introduces an unconditional host reboot action (via sysrq-trigger) in the main daemon loop. Consider gating this behavior behind an explicit config/feature flag (similar to EnableDriftDetectionAndRemediation) and adding rate limiting/guardrails to reduce the risk of reboot loops or unexpected reboots in environments that don’t want automated power actions.
| rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt", | ||
| "/bin/bash", "-c", "echo b > /proc/sysrq-trigger") | ||
|
|
||
| return rebootCmd.Run() |
There was a problem hiding this comment.
Direct use of exec.Command here may trigger gosec (G204) and also provides no timeout or captured stderr/stdout for troubleshooting. Prefer using the repo’s command execution helper (e.g., pkg/utils.RunSystemCommand / RunCommandWithOutput) or add a scoped #nosec with justification and use exec.CommandContext with a timeout so failures are observable and the call can’t hang indefinitely.
| rebootCmd := exec.Command("/usr/bin/nsenter", "-m/proc/1/ns/mnt", | |
| "/bin/bash", "-c", "echo b > /proc/sysrq-trigger") | |
| return rebootCmd.Run() | |
| // Use a bounded context so the reboot command can't hang indefinitely. | |
| ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) | |
| defer cancel() | |
| // #nosec G204 -- command and arguments are constant literals; no user input is interpolated. | |
| rebootCmd := exec.CommandContext(ctx, "/usr/bin/nsenter", "-m/proc/1/ns/mnt", | |
| "/bin/bash", "-c", "echo b > /proc/sysrq-trigger") | |
| output, err := rebootCmd.CombinedOutput() | |
| if ctx.Err() == context.DeadlineExceeded { | |
| return fmt.Errorf("reboot command timed out: %w; output: %s", err, strings.TrimSpace(string(output))) | |
| } | |
| if err != nil { | |
| return fmt.Errorf("reboot command failed: %w; output: %s", err, strings.TrimSpace(string(output))) | |
| } | |
| return nil |
No description provided.