Skip to content

Possible inspiration for a randomx fix -- I tried it on 6.8 and 6.11 kernels and it didnt trigger on any of them, LLM states the fpu patterns is optimized away#749

Open
entlein wants to merge 3 commits intokubescape:mainfrom
k8sstormcenter:upstream/randomx-fix
Open

Possible inspiration for a randomx fix -- I tried it on 6.8 and 6.11 kernels and it didnt trigger on any of them, LLM states the fpu patterns is optimized away#749
entlein wants to merge 3 commits intokubescape:mainfrom
k8sstormcenter:upstream/randomx-fix

Conversation

@entlein
Copy link

@entlein entlein commented Mar 14, 2026

Overview

Summary by CodeRabbit

  • New Features

    • Enhanced crypto-miner detection with improved accuracy and sensitivity using advanced monitoring techniques.
  • Tests

    • Added comprehensive test cases for crypto-miner detection benchmarking and profile validation.
    • Added test deployment resources for crypto-miner scenario testing and validation.

@coderabbitai
Copy link

coderabbitai bot commented Mar 14, 2026

📝 Walkthrough

Walkthrough

The changes implement a sliding-window based two-signal detector for crypto-miner activity in the eBPF program, replacing simple event counting with threshold-based alerting. Tests are reorganized with new subtests and architecture guards, and a Kubernetes deployment manifest for crypto-miner testing is added.

Changes

Cohort / File(s) Summary
eBPF Detection Logic
pkg/ebpf/gadgets/randomx/program.bpf.c
Replaces per-mntns event counting with sliding-window (30s) two-signal detector. Introduces threshold-based alerting: sched_switch preemptions (10,000) and x86_fpu_regs_deactivated events (500,000). Adds per-mntns cache structure, helper functions maybe_reset_window() and maybe_alert(), and unified event emission via gadget framework.
Test Infrastructure Updates
tests/component_test.go
Refactors Test_10_MalwareDetectionTest into subtests: "empty_profile_rules" and "randomx_bench". Adds runtime import and architecture guard to skip randomx_bench on non-amd64 systems.
Test Deployment Configuration
tests/resources/crypto-miner-deployment.yaml
New Kubernetes Deployment manifest (k8s-miner-deployment) with crypto-miner container image, resource limits (2Gi memory, 2 CPU), and test profile label.

Sequence Diagram

sequenceDiagram
    participant Kernel as System Kernel
    participant eBPF as eBPF Program
    participant Cache as Per-mntns Cache
    participant Gadget as Gadget Framework
    
    Kernel->>eBPF: Trigger event (sched_switch or x86_fpu)
    eBPF->>Cache: Lookup cache for mntns_id
    Cache-->>eBPF: Return cache entry
    eBPF->>eBPF: maybe_reset_window()
    alt Window expired
        eBPF->>Cache: Reset counters & window_start
    end
    eBPF->>Cache: Increment signal counter
    eBPF->>eBPF: Check threshold crossed?
    alt Threshold exceeded
        eBPF->>eBPF: maybe_alert()
        eBPF->>Gadget: Emit alert event (process info, path, timestamp)
        Gadget-->>Gadget: Buffer and submit event
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Whiskers twitching with delight:
Two signals dance through sliding night,
Preemptions and registers aligned,
A crypto-miner cannot hide,
The window slides, the thresholds bind! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and conversational, describing the author's experimentation process rather than the actual change being made. It uses phrases like 'Possible inspiration' and references LLM commentary, which are not descriptive of the code modifications. Clarify the title to describe the actual changes: consider something like 'Refactor RandomX detection with sliding-window dual-signal thresholds' to reflect the core modification (sliding window, dual signals, threshold-based alerting).
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

entlein added 3 commits March 14, 2026 11:04
Signed-off-by: tanzee <einentlein@gmail.com>
… this is likely a longer story, but the tutorial is a 6.8 kernel and I need to make it work again

Signed-off-by: tanzee <einentlein@gmail.com>
Signed-off-by: tanzee <einentlein@gmail.com>
@entlein entlein force-pushed the upstream/randomx-fix branch from c7b3f9c to 9f609fc Compare March 14, 2026 10:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
tests/resources/crypto-miner-deployment.yaml (1)

17-28: Consider adding securityContext for test hygiene (optional).

Static analysis flags that the container doesn't explicitly set allowPrivilegeEscalation: false or runAsNonRoot: true. While this may be intentional for crypto-miner detection testing (the miner might require certain capabilities), explicitly documenting the security posture is good practice for test manifests.

If the crypto-miner detection doesn't require elevated privileges:

♻️ Optional: Add explicit securityContext
     spec:
       containers:
       - name: k8s-miner
         image: docker.io/amitschendel/crypto-miner-1
         imagePullPolicy: Always
+        securityContext:
+          allowPrivilegeEscalation: false
+          # runAsNonRoot: true  # Uncomment if the image supports non-root
         resources:
           limits:
             cpu: "2"
             memory: "2Gi"

Otherwise, add a comment explaining why elevated privileges are needed for the test scenario.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/resources/crypto-miner-deployment.yaml` around lines 17 - 28, Add an
explicit securityContext for the k8s-miner container: under the container spec
for name "k8s-miner" add securityContext with allowPrivilegeEscalation: false
and runAsNonRoot: true (and optionally set readOnlyRootFilesystem: true) to
document a non‑privileged test posture; if the test genuinely requires elevated
privileges, instead add an inline comment adjacent to the "k8s-miner" container
explaining why elevated privileges or capabilities are needed for this scenario.
tests/component_test.go (2)

574-680: Resource cleanup missing in empty_profile_rules subtest.

The subtest creates a namespace via testutils.NewRandomNamespace() and a workload via testutils.NewTestWorkload(), but neither is cleaned up after the test completes. While random namespaces may be orphaned (depending on test framework teardown), explicitly deleting resources ensures consistent test isolation and avoids resource leaks in CI environments.

Consider adding cleanup via defer:

ns := testutils.NewRandomNamespace()
defer func() {
    // Cleanup namespace if needed
}()
// ...
wl, err := testutils.NewTestWorkload(ns.Name, ...)
require.NoError(t, err)
defer wl.Delete()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/component_test.go` around lines 574 - 680, The subtest leaks
resources—after calling testutils.NewRandomNamespace() and creating the workload
with testutils.NewTestWorkload(...) you must add deferred cleanup: immediately
after ns := NewRandomNamespace() defer removal of the namespace (e.g., call the
testutils namespace-delete/cleanup helper or ns.Delete if provided) and
immediately after wl is created defer wl.Delete() (or the workload teardown
method). Place the defers right after ns and wl are created so namespace and
workload are always removed even on test failure.

689-734: Architecture guard and resource cleanup for randomx_bench subtest.

  1. LGTM on architecture guard (lines 690-692): Correctly skips the test on non-amd64 architectures since the randomx tracer is x86_64 only.

  2. Resource cleanup missing: Similar to empty_profile_rules, this subtest creates a namespace and workload but doesn't clean them up.

  3. Potential test flakiness: The test waits up to 120 seconds for the R1007 alert. Given the PR description mentions "the randomx tracer really doesn't trigger on certain kernels" and "FPU patterns may be optimized away," consider:

    • Adding more diagnostic logging if the test fails
    • Documenting kernel version requirements in the test comments
♻️ Suggested cleanup pattern
 t.Run("randomx_bench", func(t *testing.T) {
   if runtime.GOARCH != "amd64" {
     t.Skip("randomx tracer is x86_64 only")
   }

   ns := testutils.NewRandomNamespace()
+  defer func() {
+    // Namespace cleanup handled by test framework or explicit delete
+  }()

   wl, err := testutils.NewTestWorkload(ns.Name,
     path.Join(utils.CurrentDir(), "resources/crypto-miner-deployment.yaml"))
   require.NoError(t, err)
+  defer wl.Delete()
   require.NoError(t, wl.WaitForReady(80))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/component_test.go` around lines 689 - 734, The subtest creates
resources with testutils.NewRandomNamespace() and testutils.NewTestWorkload(...)
but doesn't clean them up and could use extra diagnostics; add resource teardown
(e.g. defer testutils.DeleteNamespace(ns.Name) and defer
wl.Delete()/wl.Cleanup() or the equivalent cleanup functions provided by the
workload/namespace types immediately after ns and wl are created), add
additional diagnostic logs before the final assertion (log kernel version and
the raw alerts from testutils.GetAlerts(ns.Name) when the Eventually block
fails), and add a short comment near the top of the "randomx_bench" test
documenting kernel/version expectations for the RandomX tracer so future
failures are clearer.
pkg/ebpf/gadgets/randomx/program.bpf.c (2)

152-158: Lost event on window reset: current preemption is not counted.

When maybe_reset_window returns true (line 152-153), the function returns immediately without incrementing preempt_count for the current event. This means one preemption event per window reset is silently discarded.

While this has negligible impact given the 10,000 event threshold, it's a subtle inconsistency. Consider incrementing the counter before checking for window reset, or incrementing after reset:

♻️ Optional fix to count the triggering event
     if (maybe_reset_window(cache, now, mntns_id))
-        return 0;
+    {
+        // Window was reset; count this event in the new window.
+        cache->preempt_count = 1;
+        bpf_map_update_elem(&mntns_event_count, &mntns_id, cache, BPF_ANY);
+        return 0;
+    }

     cache->preempt_count++;

Same applies to fpu_count on line 191-195.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ebpf/gadgets/randomx/program.bpf.c` around lines 152 - 158, The current
code returns immediately when maybe_reset_window(cache, now, mntns_id) is true,
which drops the current event because cache->preempt_count is only incremented
after that check; update the logic so the triggering event is counted: increment
cache->preempt_count before calling maybe_reset_window (or, if you prefer
minimal change, call maybe_reset_window and then increment cache->preempt_count
even when a reset occurred), then persist with
bpf_map_update_elem(&mntns_event_count, &mntns_id, cache, BPF_ANY) and call
maybe_alert; apply the same fix for fpu_count where the reset path currently
skips counting the triggering FPU event (referencing fpu_count,
maybe_reset_window, and the mntns_event_count update).

72-83: Window reset doesn't clear the alerted flag — verify this is intentional.

When maybe_reset_window resets the sliding window, it clears fpu_count and preempt_count but does NOT clear the alerted flag. This means once a container triggers an alert, it will never alert again (even across multiple 30-second windows) until the LRU entry is evicted.

Given the comment on line 98-99 ("no further events are emitted for this container"), this appears intentional, but the behavior should be documented in the function comment:

📝 Suggested documentation clarification
 // ---------------------------------------------------------------------------
 // Helper: reset the sliding window if it expired.
-// Returns true if the window was reset (caller should return 0 early).
+// Returns true if the window was reset (caller should return 0 early).
+// NOTE: Does NOT reset the alerted flag — once a container triggers an alert,
+// it is silenced until the LRU entry is evicted. This is intentional to avoid
+// flooding the event buffer with repeated alerts from the same miner.
 // ---------------------------------------------------------------------------
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/ebpf/gadgets/randomx/program.bpf.c` around lines 72 - 83, The window
reset currently clears cache->window_start, fpu_count and preempt_count in
maybe_reset_window but leaves cache->alerted set, which causes a container to
never re-alert until eviction; either explicitly clear cache->alerted in
maybe_reset_window (resetting alert state each WINDOW_NS) or add a clear,
explicit comment above maybe_reset_window explaining that alerted is
intentionally persistent across windows to suppress further events until LRU
eviction; refer to maybe_reset_window, struct mntns_cache, the alerted field,
mntns_event_count and WINDOW_NS when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/component_test.go`:
- Around line 544-546: The test block ending with the unmatched `})` needs to be
wrapped in a subtest to match the other tests; wrap the initial block that
contains the malware assertions with t.Run("malware_signature", func(t
*testing.T) { ... }) so the opening `t.Run` encloses the code that currently
ends with the stray `})` (and remove any extra/mismatched brace), ensuring the
assertion line assert.Equal(...) and its surrounding setup live inside the new
t.Run closure and match the subsequent subtest structure like
`t.Run("empty_profile_rules", func(t *testing.T) {...})`.

---

Nitpick comments:
In `@pkg/ebpf/gadgets/randomx/program.bpf.c`:
- Around line 152-158: The current code returns immediately when
maybe_reset_window(cache, now, mntns_id) is true, which drops the current event
because cache->preempt_count is only incremented after that check; update the
logic so the triggering event is counted: increment cache->preempt_count before
calling maybe_reset_window (or, if you prefer minimal change, call
maybe_reset_window and then increment cache->preempt_count even when a reset
occurred), then persist with bpf_map_update_elem(&mntns_event_count, &mntns_id,
cache, BPF_ANY) and call maybe_alert; apply the same fix for fpu_count where the
reset path currently skips counting the triggering FPU event (referencing
fpu_count, maybe_reset_window, and the mntns_event_count update).
- Around line 72-83: The window reset currently clears cache->window_start,
fpu_count and preempt_count in maybe_reset_window but leaves cache->alerted set,
which causes a container to never re-alert until eviction; either explicitly
clear cache->alerted in maybe_reset_window (resetting alert state each
WINDOW_NS) or add a clear, explicit comment above maybe_reset_window explaining
that alerted is intentionally persistent across windows to suppress further
events until LRU eviction; refer to maybe_reset_window, struct mntns_cache, the
alerted field, mntns_event_count and WINDOW_NS when making the change.

In `@tests/component_test.go`:
- Around line 574-680: The subtest leaks resources—after calling
testutils.NewRandomNamespace() and creating the workload with
testutils.NewTestWorkload(...) you must add deferred cleanup: immediately after
ns := NewRandomNamespace() defer removal of the namespace (e.g., call the
testutils namespace-delete/cleanup helper or ns.Delete if provided) and
immediately after wl is created defer wl.Delete() (or the workload teardown
method). Place the defers right after ns and wl are created so namespace and
workload are always removed even on test failure.
- Around line 689-734: The subtest creates resources with
testutils.NewRandomNamespace() and testutils.NewTestWorkload(...) but doesn't
clean them up and could use extra diagnostics; add resource teardown (e.g. defer
testutils.DeleteNamespace(ns.Name) and defer wl.Delete()/wl.Cleanup() or the
equivalent cleanup functions provided by the workload/namespace types
immediately after ns and wl are created), add additional diagnostic logs before
the final assertion (log kernel version and the raw alerts from
testutils.GetAlerts(ns.Name) when the Eventually block fails), and add a short
comment near the top of the "randomx_bench" test documenting kernel/version
expectations for the RandomX tracer so future failures are clearer.

In `@tests/resources/crypto-miner-deployment.yaml`:
- Around line 17-28: Add an explicit securityContext for the k8s-miner
container: under the container spec for name "k8s-miner" add securityContext
with allowPrivilegeEscalation: false and runAsNonRoot: true (and optionally set
readOnlyRootFilesystem: true) to document a non‑privileged test posture; if the
test genuinely requires elevated privileges, instead add an inline comment
adjacent to the "k8s-miner" container explaining why elevated privileges or
capabilities are needed for this scenario.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0fabd740-fb69-4dc7-a3b8-0a9eb4281037

📥 Commits

Reviewing files that changed from the base of the PR and between f673d45 and 9f609fc.

📒 Files selected for processing (3)
  • pkg/ebpf/gadgets/randomx/program.bpf.c
  • tests/component_test.go
  • tests/resources/crypto-miner-deployment.yaml

Comment on lines +544 to +546
assert.Equal(t, len(expectedMalwares), len(malwaresDetected),
"Expected %d malwares to be detected, but got %d", len(expectedMalwares), len(malwaresDetected))
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the structure of Test_10_MalwareDetectionTest to see if the first subtest is properly defined
rg -n -B5 -A2 'func Test_10_MalwareDetectionTest' tests/component_test.go
# Also check for t.Run calls within this test
rg -n 't\.Run\(' tests/component_test.go | grep -A1 -B1 '10_Malware'

Repository: kubescape/node-agent

Length of output: 352


🏁 Script executed:

sed -n '498,550p' tests/component_test.go

Repository: kubescape/node-agent

Length of output: 2181


🏁 Script executed:

sed -n '546,570p' tests/component_test.go

Repository: kubescape/node-agent

Length of output: 1369


🏁 Script executed:

sed -n '546,620p' tests/component_test.go | head -80

Repository: kubescape/node-agent

Length of output: 3440


Syntax error: Unmatched closing brace }) on line 546.

The inline test code at lines 499-545 attempts to close with }) on line 546, but there is no matching opening t.Run( call. The first test block should be wrapped with t.Run("malware_signature", func(t *testing.T) { ... }) to maintain consistent test organization and fix the compilation error. The subsequent subtests (empty_profile_rules on line 577 and randomx_bench) are properly structured with t.Run, but the initial test block is missing this wrapper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/component_test.go` around lines 544 - 546, The test block ending with
the unmatched `})` needs to be wrapped in a subtest to match the other tests;
wrap the initial block that contains the malware assertions with
t.Run("malware_signature", func(t *testing.T) { ... }) so the opening `t.Run`
encloses the code that currently ends with the stray `})` (and remove any
extra/mismatched brace), ensuring the assertion line assert.Equal(...) and its
surrounding setup live inside the new t.Run closure and match the subsequent
subtest structure like `t.Run("empty_profile_rules", func(t *testing.T) {...})`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: WIP

Development

Successfully merging this pull request may close these issues.

2 participants