Possible inspiration for a randomx fix -- I tried it on 6.8 and 6.11 kernels and it didnt trigger on any of them, LLM states the fpu patterns is optimized away#749
Conversation
📝 WalkthroughWalkthroughThe changes implement a sliding-window based two-signal detector for crypto-miner activity in the eBPF program, replacing simple event counting with threshold-based alerting. Tests are reorganized with new subtests and architecture guards, and a Kubernetes deployment manifest for crypto-miner testing is added. Changes
Sequence DiagramsequenceDiagram
participant Kernel as System Kernel
participant eBPF as eBPF Program
participant Cache as Per-mntns Cache
participant Gadget as Gadget Framework
Kernel->>eBPF: Trigger event (sched_switch or x86_fpu)
eBPF->>Cache: Lookup cache for mntns_id
Cache-->>eBPF: Return cache entry
eBPF->>eBPF: maybe_reset_window()
alt Window expired
eBPF->>Cache: Reset counters & window_start
end
eBPF->>Cache: Increment signal counter
eBPF->>eBPF: Check threshold crossed?
alt Threshold exceeded
eBPF->>eBPF: maybe_alert()
eBPF->>Gadget: Emit alert event (process info, path, timestamp)
Gadget-->>Gadget: Buffer and submit event
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: tanzee <einentlein@gmail.com>
… this is likely a longer story, but the tutorial is a 6.8 kernel and I need to make it work again Signed-off-by: tanzee <einentlein@gmail.com>
Signed-off-by: tanzee <einentlein@gmail.com>
c7b3f9c to
9f609fc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tests/resources/crypto-miner-deployment.yaml (1)
17-28: Consider adding securityContext for test hygiene (optional).Static analysis flags that the container doesn't explicitly set
allowPrivilegeEscalation: falseorrunAsNonRoot: true. While this may be intentional for crypto-miner detection testing (the miner might require certain capabilities), explicitly documenting the security posture is good practice for test manifests.If the crypto-miner detection doesn't require elevated privileges:
♻️ Optional: Add explicit securityContext
spec: containers: - name: k8s-miner image: docker.io/amitschendel/crypto-miner-1 imagePullPolicy: Always + securityContext: + allowPrivilegeEscalation: false + # runAsNonRoot: true # Uncomment if the image supports non-root resources: limits: cpu: "2" memory: "2Gi"Otherwise, add a comment explaining why elevated privileges are needed for the test scenario.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/resources/crypto-miner-deployment.yaml` around lines 17 - 28, Add an explicit securityContext for the k8s-miner container: under the container spec for name "k8s-miner" add securityContext with allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally set readOnlyRootFilesystem: true) to document a non‑privileged test posture; if the test genuinely requires elevated privileges, instead add an inline comment adjacent to the "k8s-miner" container explaining why elevated privileges or capabilities are needed for this scenario.tests/component_test.go (2)
574-680: Resource cleanup missing inempty_profile_rulessubtest.The subtest creates a namespace via
testutils.NewRandomNamespace()and a workload viatestutils.NewTestWorkload(), but neither is cleaned up after the test completes. While random namespaces may be orphaned (depending on test framework teardown), explicitly deleting resources ensures consistent test isolation and avoids resource leaks in CI environments.Consider adding cleanup via
defer:ns := testutils.NewRandomNamespace() defer func() { // Cleanup namespace if needed }() // ... wl, err := testutils.NewTestWorkload(ns.Name, ...) require.NoError(t, err) defer wl.Delete()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/component_test.go` around lines 574 - 680, The subtest leaks resources—after calling testutils.NewRandomNamespace() and creating the workload with testutils.NewTestWorkload(...) you must add deferred cleanup: immediately after ns := NewRandomNamespace() defer removal of the namespace (e.g., call the testutils namespace-delete/cleanup helper or ns.Delete if provided) and immediately after wl is created defer wl.Delete() (or the workload teardown method). Place the defers right after ns and wl are created so namespace and workload are always removed even on test failure.
689-734: Architecture guard and resource cleanup forrandomx_benchsubtest.
LGTM on architecture guard (lines 690-692): Correctly skips the test on non-amd64 architectures since the randomx tracer is x86_64 only.
Resource cleanup missing: Similar to
empty_profile_rules, this subtest creates a namespace and workload but doesn't clean them up.Potential test flakiness: The test waits up to 120 seconds for the R1007 alert. Given the PR description mentions "the randomx tracer really doesn't trigger on certain kernels" and "FPU patterns may be optimized away," consider:
- Adding more diagnostic logging if the test fails
- Documenting kernel version requirements in the test comments
♻️ Suggested cleanup pattern
t.Run("randomx_bench", func(t *testing.T) { if runtime.GOARCH != "amd64" { t.Skip("randomx tracer is x86_64 only") } ns := testutils.NewRandomNamespace() + defer func() { + // Namespace cleanup handled by test framework or explicit delete + }() wl, err := testutils.NewTestWorkload(ns.Name, path.Join(utils.CurrentDir(), "resources/crypto-miner-deployment.yaml")) require.NoError(t, err) + defer wl.Delete() require.NoError(t, wl.WaitForReady(80))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/component_test.go` around lines 689 - 734, The subtest creates resources with testutils.NewRandomNamespace() and testutils.NewTestWorkload(...) but doesn't clean them up and could use extra diagnostics; add resource teardown (e.g. defer testutils.DeleteNamespace(ns.Name) and defer wl.Delete()/wl.Cleanup() or the equivalent cleanup functions provided by the workload/namespace types immediately after ns and wl are created), add additional diagnostic logs before the final assertion (log kernel version and the raw alerts from testutils.GetAlerts(ns.Name) when the Eventually block fails), and add a short comment near the top of the "randomx_bench" test documenting kernel/version expectations for the RandomX tracer so future failures are clearer.pkg/ebpf/gadgets/randomx/program.bpf.c (2)
152-158: Lost event on window reset: current preemption is not counted.When
maybe_reset_windowreturnstrue(line 152-153), the function returns immediately without incrementingpreempt_countfor the current event. This means one preemption event per window reset is silently discarded.While this has negligible impact given the 10,000 event threshold, it's a subtle inconsistency. Consider incrementing the counter before checking for window reset, or incrementing after reset:
♻️ Optional fix to count the triggering event
if (maybe_reset_window(cache, now, mntns_id)) - return 0; + { + // Window was reset; count this event in the new window. + cache->preempt_count = 1; + bpf_map_update_elem(&mntns_event_count, &mntns_id, cache, BPF_ANY); + return 0; + } cache->preempt_count++;Same applies to
fpu_counton line 191-195.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/ebpf/gadgets/randomx/program.bpf.c` around lines 152 - 158, The current code returns immediately when maybe_reset_window(cache, now, mntns_id) is true, which drops the current event because cache->preempt_count is only incremented after that check; update the logic so the triggering event is counted: increment cache->preempt_count before calling maybe_reset_window (or, if you prefer minimal change, call maybe_reset_window and then increment cache->preempt_count even when a reset occurred), then persist with bpf_map_update_elem(&mntns_event_count, &mntns_id, cache, BPF_ANY) and call maybe_alert; apply the same fix for fpu_count where the reset path currently skips counting the triggering FPU event (referencing fpu_count, maybe_reset_window, and the mntns_event_count update).
72-83: Window reset doesn't clear thealertedflag — verify this is intentional.When
maybe_reset_windowresets the sliding window, it clearsfpu_countandpreempt_countbut does NOT clear thealertedflag. This means once a container triggers an alert, it will never alert again (even across multiple 30-second windows) until the LRU entry is evicted.Given the comment on line 98-99 ("no further events are emitted for this container"), this appears intentional, but the behavior should be documented in the function comment:
📝 Suggested documentation clarification
// --------------------------------------------------------------------------- // Helper: reset the sliding window if it expired. -// Returns true if the window was reset (caller should return 0 early). +// Returns true if the window was reset (caller should return 0 early). +// NOTE: Does NOT reset the alerted flag — once a container triggers an alert, +// it is silenced until the LRU entry is evicted. This is intentional to avoid +// flooding the event buffer with repeated alerts from the same miner. // ---------------------------------------------------------------------------🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/ebpf/gadgets/randomx/program.bpf.c` around lines 72 - 83, The window reset currently clears cache->window_start, fpu_count and preempt_count in maybe_reset_window but leaves cache->alerted set, which causes a container to never re-alert until eviction; either explicitly clear cache->alerted in maybe_reset_window (resetting alert state each WINDOW_NS) or add a clear, explicit comment above maybe_reset_window explaining that alerted is intentionally persistent across windows to suppress further events until LRU eviction; refer to maybe_reset_window, struct mntns_cache, the alerted field, mntns_event_count and WINDOW_NS when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/component_test.go`:
- Around line 544-546: The test block ending with the unmatched `})` needs to be
wrapped in a subtest to match the other tests; wrap the initial block that
contains the malware assertions with t.Run("malware_signature", func(t
*testing.T) { ... }) so the opening `t.Run` encloses the code that currently
ends with the stray `})` (and remove any extra/mismatched brace), ensuring the
assertion line assert.Equal(...) and its surrounding setup live inside the new
t.Run closure and match the subsequent subtest structure like
`t.Run("empty_profile_rules", func(t *testing.T) {...})`.
---
Nitpick comments:
In `@pkg/ebpf/gadgets/randomx/program.bpf.c`:
- Around line 152-158: The current code returns immediately when
maybe_reset_window(cache, now, mntns_id) is true, which drops the current event
because cache->preempt_count is only incremented after that check; update the
logic so the triggering event is counted: increment cache->preempt_count before
calling maybe_reset_window (or, if you prefer minimal change, call
maybe_reset_window and then increment cache->preempt_count even when a reset
occurred), then persist with bpf_map_update_elem(&mntns_event_count, &mntns_id,
cache, BPF_ANY) and call maybe_alert; apply the same fix for fpu_count where the
reset path currently skips counting the triggering FPU event (referencing
fpu_count, maybe_reset_window, and the mntns_event_count update).
- Around line 72-83: The window reset currently clears cache->window_start,
fpu_count and preempt_count in maybe_reset_window but leaves cache->alerted set,
which causes a container to never re-alert until eviction; either explicitly
clear cache->alerted in maybe_reset_window (resetting alert state each
WINDOW_NS) or add a clear, explicit comment above maybe_reset_window explaining
that alerted is intentionally persistent across windows to suppress further
events until LRU eviction; refer to maybe_reset_window, struct mntns_cache, the
alerted field, mntns_event_count and WINDOW_NS when making the change.
In `@tests/component_test.go`:
- Around line 574-680: The subtest leaks resources—after calling
testutils.NewRandomNamespace() and creating the workload with
testutils.NewTestWorkload(...) you must add deferred cleanup: immediately after
ns := NewRandomNamespace() defer removal of the namespace (e.g., call the
testutils namespace-delete/cleanup helper or ns.Delete if provided) and
immediately after wl is created defer wl.Delete() (or the workload teardown
method). Place the defers right after ns and wl are created so namespace and
workload are always removed even on test failure.
- Around line 689-734: The subtest creates resources with
testutils.NewRandomNamespace() and testutils.NewTestWorkload(...) but doesn't
clean them up and could use extra diagnostics; add resource teardown (e.g. defer
testutils.DeleteNamespace(ns.Name) and defer wl.Delete()/wl.Cleanup() or the
equivalent cleanup functions provided by the workload/namespace types
immediately after ns and wl are created), add additional diagnostic logs before
the final assertion (log kernel version and the raw alerts from
testutils.GetAlerts(ns.Name) when the Eventually block fails), and add a short
comment near the top of the "randomx_bench" test documenting kernel/version
expectations for the RandomX tracer so future failures are clearer.
In `@tests/resources/crypto-miner-deployment.yaml`:
- Around line 17-28: Add an explicit securityContext for the k8s-miner
container: under the container spec for name "k8s-miner" add securityContext
with allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally set
readOnlyRootFilesystem: true) to document a non‑privileged test posture; if the
test genuinely requires elevated privileges, instead add an inline comment
adjacent to the "k8s-miner" container explaining why elevated privileges or
capabilities are needed for this scenario.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0fabd740-fb69-4dc7-a3b8-0a9eb4281037
📒 Files selected for processing (3)
pkg/ebpf/gadgets/randomx/program.bpf.ctests/component_test.gotests/resources/crypto-miner-deployment.yaml
| assert.Equal(t, len(expectedMalwares), len(malwaresDetected), | ||
| "Expected %d malwares to be detected, but got %d", len(expectedMalwares), len(malwaresDetected)) | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the structure of Test_10_MalwareDetectionTest to see if the first subtest is properly defined
rg -n -B5 -A2 'func Test_10_MalwareDetectionTest' tests/component_test.go
# Also check for t.Run calls within this test
rg -n 't\.Run\(' tests/component_test.go | grep -A1 -B1 '10_Malware'Repository: kubescape/node-agent
Length of output: 352
🏁 Script executed:
sed -n '498,550p' tests/component_test.goRepository: kubescape/node-agent
Length of output: 2181
🏁 Script executed:
sed -n '546,570p' tests/component_test.goRepository: kubescape/node-agent
Length of output: 1369
🏁 Script executed:
sed -n '546,620p' tests/component_test.go | head -80Repository: kubescape/node-agent
Length of output: 3440
Syntax error: Unmatched closing brace }) on line 546.
The inline test code at lines 499-545 attempts to close with }) on line 546, but there is no matching opening t.Run( call. The first test block should be wrapped with t.Run("malware_signature", func(t *testing.T) { ... }) to maintain consistent test organization and fix the compilation error. The subsequent subtests (empty_profile_rules on line 577 and randomx_bench) are properly structured with t.Run, but the initial test block is missing this wrapper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/component_test.go` around lines 544 - 546, The test block ending with
the unmatched `})` needs to be wrapped in a subtest to match the other tests;
wrap the initial block that contains the malware assertions with
t.Run("malware_signature", func(t *testing.T) { ... }) so the opening `t.Run`
encloses the code that currently ends with the stray `})` (and remove any
extra/mismatched brace), ensuring the assertion line assert.Equal(...) and its
surrounding setup live inside the new t.Run closure and match the subsequent
subtest structure like `t.Run("empty_profile_rules", func(t *testing.T) {...})`.
Overview
Summary by CodeRabbit
New Features
Tests