Skip to content

test(security): add Brev E2E tests for command injection and credential sanitization#1092

Merged
cv merged 26 commits intoNVIDIA:mainfrom
jyaunches:feat/security-e2e-tests
Mar 30, 2026
Merged

test(security): add Brev E2E tests for command injection and credential sanitization#1092
cv merged 26 commits intoNVIDIA:mainfrom
jyaunches:feat/security-e2e-tests

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Mar 30, 2026

Summary

Adds an ephemeral Brev-based E2E test infrastructure for running security test suites against a live NemoClaw sandbox. Validates the fixes from PR #584 (command injection) and PR #743 (credential sanitization) end-to-end.

Validated on fork: Run #23668176481 — 24/24 credential sanitization + 18/18 telegram injection ✅

Depends on: #1090 (setup.sh CI improvements — timestamped output, NEMOCLAW_SANDBOX_NAME, --no-tty, base image fallback)


What already existed in the repo

  • nightly-e2e.yaml — runs test/e2e/test-full-e2e.sh nightly on GitHub-hosted runners. Tests the full install → onboard → inference journey. Runs directly on the runner (no remote instance).
  • test/e2e/test-full-e2e.sh — the existing full E2E script testing the complete user journey.
  • vitest.config.ts — had cli and plugin projects.

What this adds

New files

File Purpose
.github/workflows/e2e-brev.yaml Workflow that provisions an ephemeral Brev cloud instance, bootstraps NemoClaw over SSH, runs test suites remotely, reports results to PRs, and optionally keeps the instance alive for debugging.
test/e2e/brev-e2e.test.js Vitest orchestrator: Brev auth → create instance → SSH wait → code sync → bootstrap → nemoclaw CLI install → sandbox registry → run remote tests → cleanup.
test/e2e/test-credential-sanitization.sh 24 tests for PR #743: credential stripping from migration bundles (C1–C5), runtime sandbox credential checks (C6–C7), symlink traversal protection (C8), blueprint digest verification (C9–C11), pattern-based credential field detection (C12–C13), and shipped blueprint state (Phase 6).
test/e2e/test-telegram-injection.sh 18 tests for PR #584: $(cmd) substitution (T1), backtick injection (T2), quote breakout (T3), ${VAR} expansion (T4), process table leak checks (T5), SANDBOX_NAME validation with shell metacharacters (T6–T7), and normal message regression (T8).

Modified files

File Change
vitest.config.ts Added e2e-brev project, gated by BREV_API_TOKEN env var so it never runs in normal npx vitest.

Key design decisions

  • Ephemeral instances — no persistent state, no idle cost. Instance is created fresh for each run and destroyed after (unless keep_alive: true).
  • CPU-only (--cpu 4x16) — OpenShell gateway and sandbox work without GPU. Security tests don't need inference.
  • Secrets via stdinNVIDIA_API_KEY and GITHUB_TOKEN are piped to the remote VM through SSH stdin, never appearing in CLI args or process listings.
  • Sandbox registry workaroundsetup.sh creates sandboxes via openshell directly without writing to ~/.nemoclaw/sandboxes.json. The test orchestrator writes the registry entry after bootstrap so nemoclaw <name> status works.
  • full suite isolationfull runs install.sh --non-interactive which destroys and recreates the sandbox. The security tests need the existing sandbox, so all runs only credential-sanitization + telegram-injection (not full).

What needs to be in the repo to run this

Secrets (configured by repo admin)

Secret Description
BREV_API_TOKEN Brev refresh token for headless NGC auth. Obtain via brev login, then copy the refresh_token from ~/.brev/credentials.json.
NVIDIA_API_KEY NVIDIA API key for inference provider setup during sandbox creation.

Infrastructure

  • Brev organization with CPU instance credits (currently Nemoclaw CI/CD).
  • Brev CLI is pinned to v0.6.310 in the workflow (v0.6.322 removed the --cpu flag).

How a maintainer uses this

Running from the Actions UI

  1. Go to Actions → e2e-brev → Run workflow
  2. Select inputs:
    • branch: which branch to test
    • test_suite: see descriptions in the workflow header
      • full — complete user journey: install → onboard → inference → CLI ops (~10 min, destroys sandbox)
      • credential-sanitization — 24 tests for migration credential stripping, digest verification, symlink safety
      • telegram-injection — 18 tests for shell injection prevention across all attack vectors
      • all — runs both security suites (not full, which would destroy the sandbox they need)
    • keep_alive: true to SSH in after, false to auto-cleanup
  3. Results appear in the workflow run logs. If pr_number is set, a check run and comment are posted to the PR.

Calling from another workflow

jobs:
  security-e2e:
    uses: ./.github/workflows/e2e-brev.yaml
    with:
      branch: ${{ github.head_ref }}
      pr_number: ${{ github.event.pull_request.number }}
      test_suite: all
    secrets:
      BREV_API_TOKEN: ${{ secrets.BREV_API_TOKEN }}
      NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}

SSH debugging (keep_alive: true)

When keep_alive is enabled, the instance stays running after tests. The workflow logs and PR comment include connection instructions:

# Refresh local SSH config to pick up the Brev instance
brev refresh

# SSH into the instance
ssh e2e-pr-<number>

# On the instance — re-run a test manually
cd ~/nemoclaw
export NVIDIA_API_KEY=nvapi-...
export NEMOCLAW_SANDBOX_NAME=e2e-test
bash test/e2e/test-credential-sanitization.sh

# When done, clean up
brev delete e2e-pr-<number>

⚠️ Remember to delete the instance when done. Instances left running consume Brev credits. The workflow does NOT auto-delete when keep_alive: true.


Adding new test suites

To add a new E2E test suite:

  1. Create a test script in test/e2e/ following the existing pattern (pass()/fail()/skip() functions, section() for phases, prerequisite checks in Phase 0, summary with pass/fail counts).

  2. Add the suite to brev-e2e.test.js — add a new it.runIf(...) block:

    it.runIf(TEST_SUITE === "my-new-suite" || TEST_SUITE === "all")(
      "my new suite passes on remote VM",
      () => {
        const output = runRemoteTest("test/e2e/test-my-new-suite.sh");
        expect(output).toContain("PASS");
        expect(output).not.toMatch(/FAIL:/);
      },
      600_000, // timeout in ms
    );
  3. Add the choice to the workflow — add your suite name to the options list in e2e-brev.yaml under test_suite, and add a description to the workflow header comment.

  4. Note: If your test needs the sandbox created during beforeAll, add it to the all group. If it creates/destroys its own sandbox (like full), keep it separate.


Files changed

 .github/workflows/e2e-brev.yaml          |  39 +-
 scripts/brev-setup.sh                    |  11 +-
 scripts/setup.sh                         |  52 +-
 test/e2e/brev-e2e.test.js                | 246 ++++++--
 test/e2e/test-credential-sanitization.sh | 789 +++++++++++++++++++++++++++
 test/e2e/test-telegram-injection.sh      | 464 ++++++++++++++++
 6 files changed, 1563 insertions(+), 38 deletions(-)

Summary by CodeRabbit

  • New Features

    • Optional launchable mode for remote test environments
    • New telegram-injection and credential-sanitization end-to-end test suites
  • Improvements

    • Timestamped logging for clearer diagnostics
    • Increased e2e job and bootstrap timeouts
    • Real-time sandbox build progress reporting and streaming test output
    • Prefer configurable sandbox name and improved Docker base-image handling
    • Optional skip of vLLM install during provisioning
  • Tests

    • Updated test orchestration and suite selection behavior

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional NemoClaw "launchable" provisioning (USE_LAUNCHABLE), expands CI workflow inputs (adds use_launchable and telegram-injection), increases e2e job timeout, introduces timestamped logging and build progress reporting, adds SKIP_VLLM control, and adds two end-to-end tests for credential sanitization and Telegram injection.

Changes

Cohort / File(s) Summary
Workflow Configuration
​.github/workflows/e2e-brev.yaml
Added use_launchable boolean input to workflow_dispatch and workflow_call (default true); added telegram-injection to test_suite choices; increased job timeout 45→90 minutes; pass USE_LAUNCHABLE into remote env.
Brev / Sandbox setup scripts
scripts/brev-setup.sh, scripts/setup.sh
Added _ts() timestamp helper and switched log prefixes to timestamped format; added SKIP_VLLM gating to skip vLLM install/start; changed SANDBOX_NAME precedence to prefer NEMOCLAW_SANDBOX_NAME; added BASE_IMAGE pull/build fallback and background build progress reporter; modified openshell sandbox create invocation.
E2E test harness
test/e2e/brev-e2e.test.js
Introduced USE_LAUNCHABLE-driven provisioning path (launchable boot vs legacy brev create), rsync/install/nemoclaw onboard flow for launchable, SSH stream mode with remote tee to /tmp/test-output.log, extended bootstrap timeouts, and adjusted test-suite selection (new telegram-injection suite; full runs only when TEST_SUITE === "full").
New E2E tests
test/e2e/test-credential-sanitization.sh, test/e2e/test-telegram-injection.sh
Added two Bash e2e tests: credential sanitization (bundle sanitization, symlink safety, blueprint digest checks) and Telegram injection (multiple injection probes, key-leak checks, sandbox-name hardening, PASS/FAIL/SKIP reporting).
Unit test update
test/setup-sandbox-name.test.js
Updated expectations to match new SANDBOX_NAME precedence: $1NEMOCLAW_SANDBOX_NAME env → literal nemoclaw.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CI as CI Runner
    participant Remote as Remote VM
    participant Launch as NemoClaw Launchable
    participant Sandbox as Sandbox

    CI->>Remote: SSH to provision (send USE_LAUNCHABLE)
    alt USE_LAUNCHABLE=1
        Remote->>Launch: run launch-nemoclaw.sh (request provisioning)
        Launch-->>Remote: stream provisioning status
        loop Poll readiness
            CI->>Remote: check readiness marker/logs
        end
        CI->>Remote: rsync repo, npm ci, run tests (remote tee -> /tmp/test-output.log)
        Remote->>Sandbox: run `nemoclaw onboard` / openshell commands
        Sandbox-->>Remote: test output & logs
        Remote-->>CI: streamed output and fetched logs
    else USE_LAUNCHABLE=0
        CI->>Remote: run legacy `brev create` + `scripts/brev-setup.sh`
        Remote->>Remote: local bootstrap (vLLM gated by SKIP_VLLM), build sandbox
        Remote-->>CI: streamed bootstrap output and logs
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Poem

🐰 I hopped through logs with timestamps bright,

I skipped a vLLM when told to take flight,
I probed the bridge for sneaky command tricks,
I scrubbed the secrets with careful quick flicks,
Carrots, tests, and launchables — a joyful night!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding security E2E tests for command injection and credential sanitization via Brev infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/e2e-brev.yaml:
- Line 99: The workflow's timeout-minutes (symbol: timeout-minutes) is too low
relative to the test suite budget: increase the GitHub Actions job timeout to
exceed the test's beforeAll 45-minute claim plus the two possible 10-minute test
slots and overhead so afterAll can run; update timeout-minutes to a safe value
(e.g., 90 or 120) in the workflow so runs with TEST_SUITE=all and the
beforeAll/afterAll in test/e2e/brev-e2e.test.js can complete without being
terminated.

In `@test/e2e/brev-e2e.test.js`:
- Around line 111-123: The remote pipeline is losing failure status because
pipefail isn't enabled; update runRemoteTest (and the similar block at lines
223-225) so the remote shell runs with pipefail enabled before streaming and
teeing outputs — e.g., ensure the constructed remote command uses "set -o
pipefail" or invokes bash with -o pipefail (so failures in `bash ${scriptPath} |
tee /tmp/test-output.log` and `npm ci ... | tail -5` propagate); modify the
command built for sshWithSecrets/ssh in runRemoteTest to enable pipefail so the
exit code of the pipeline reflects the first failing command.
- Around line 177-205: The polling loop that checks /tmp/launch-plugin.log and
~/.cache/nemoclaw-plugin/install-ran (using the ssh calls and variables
setupStart, setupMaxWait, setupPollInterval) currently falls through after 40
minutes allowing later steps to run; modify the code so that if neither
readiness marker ("=== Ready ===" in log nor markerCheck returning DONE) is
detected before Date.now() - setupStart >= setupMaxWait, the test fails fast by
throwing an error (or calling process.exit(1)) with a clear message including
both markers and elapsed time; place this failure immediately after the while
loop (or by breaking the loop into a condition that throws when timed out) so
rsync/npm ci/nemoclaw onboard are not executed on an incomplete VM.
- Around line 37-42: LAUNCHABLE_SETUP_SCRIPT currently points to a branch ref
and USE_LAUNCHABLE defaults to using it; change LAUNCHABLE_SETUP_SCRIPT to an
immutable source (replace the branch ref with a commit SHA in the raw GitHub URL
or vendor the script into the repo and point the constant to that vendored path)
and keep USE_LAUNCHABLE logic intact. Also update the test's setup readiness
loop (the wait loop used to detect initialization completion) to surface a clear
failure: after the existing polling loop add explicit timeout handling that
throws or fails the test when the overall wait exceeds the intended deadline
(rather than silently proceeding), so the test stops with a clear error if
initialization never completes.

In `@test/e2e/test-credential-sanitization.sh`:
- Around line 214-278: The test currently embeds local implementations of
stripCredentials and walkAndRemoveFile instead of invoking NemoClaw’s real
migration sanitizer from migration-state.ts; update the test to call the actual
sanitizer (the exported function(s) in migration-state.ts) and the repository’s
migration-aware removal logic so assertions reflect production behavior (e.g.,
gateway removal vs. partial stripping) rather than the bundled
stripCredentials/walkAndRemoveFile copies; locate and replace usages of
stripCredentials and walkAndRemoveFile in this script (and the other noted
ranges 315-367, 439-468) to import/execute the migration-state
sanitizer/exported functions and assert on its real output.
- Around line 486-590: The test is self-fulfilling because it defines local
helpers (verifyBlueprintDigest_FIXED and verifyDigest) inside node -e and never
exercises the repository's real digest-verification code; replace the inline
helpers with calls that exercise the actual verification implementation (e.g.,
require the module that exports the real verify/validate function or invoke the
CLI/entrypoint that performs manifest digest checks) and feed it crafted
manifests (empty string, undefined, wrong digest, correct digest) and blueprint
content so the test fails if the repo code is vulnerable; update the assertions
to look for the real tool's output/exit status instead of the local helper
results (reference the verifyBlueprintDigest_FIXED/verifyDigest helpers in the
diff to locate the test sections to modify).

In `@test/e2e/test-telegram-injection.sh`:
- Around line 86-109: The test currently injects messages via an SSH command
that echoes the payload (send_message_to_sandbox), which bypasses the real
bridge path and therefore misses re-evaluation issues; change the test to drive
the actual Telegram bridge invocation used in scripts/telegram-bridge.js (the
same path as runAgentInSandbox / shellQuote(message) / safeSessionId /
nemoclaw-start -m ...) so the message is passed through the bridge’s argument
handling rather than via echo/$(cat). Concretely, replace the SSH payload string
in send_message_to_sandbox with a call that runs the bridge entry (the same
command runAgentInSandbox would use, passing the session id and message as
-m/--message and using the bridge’s quoting/safe-session logic) so the tests
exercise the real message-path quoting and session handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c43264aa-fc09-4107-a384-03bbbcb23a8a

📥 Commits

Reviewing files that changed from the base of the PR and between f59f58e and bb8fd23.

📒 Files selected for processing (6)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-setup.sh
  • scripts/setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-credential-sanitization.sh
  • test/e2e/test-telegram-injection.sh

Comment on lines +37 to +42
const LAUNCHABLE_SETUP_SCRIPT =
"https://raw.githubusercontent.com/NVIDIA/OpenShell-Community/refs/heads/feat/brev-nemoclaw-plugin/brev/launch-nemoclaw.sh";
const NEMOCLAW_REPO_URL = "https://github.com/NVIDIA/NemoClaw.git";

// Use launchable by default; set USE_LAUNCHABLE=0 or USE_LAUNCHABLE=false to fall back to brev-setup.sh
const USE_LAUNCHABLE = !["0", "false"].includes(process.env.USE_LAUNCHABLE?.toLowerCase());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the test file content at the specified lines
head -n 165 test/e2e/brev-e2e.test.js | tail -n +30

Repository: NVIDIA/NemoClaw

Length of output: 5536


🏁 Script executed:

# Get full file size to understand scope
wc -l test/e2e/brev-e2e.test.js

Repository: NVIDIA/NemoClaw

Length of output: 88


🏁 Script executed:

# Search for all references to LAUNCHABLE_SETUP_SCRIPT and USE_LAUNCHABLE
rg "LAUNCHABLE_SETUP_SCRIPT|USE_LAUNCHABLE" -n

Repository: NVIDIA/NemoClaw

Length of output: 621


🏁 Script executed:

# Check for CI/CD configuration files that might reference this test
fd -t f -e yml -e yaml -e json | head -20 | xargs grep -l "brev-e2e\|e2e" 2>/dev/null

Repository: NVIDIA/NemoClaw

Length of output: 43


🏁 Script executed:

# Read the launchable path section in detail (lines 144-180)
sed -n '144,180p' test/e2e/brev-e2e.test.js

Repository: NVIDIA/NemoClaw

Length of output: 2035


🏁 Script executed:

# Check if there's any validation, pinning, or comments about the feature branch
rg "feat/brev-nemoclaw-plugin|LAUNCHABLE|launch-nemoclaw" -B 3 -A 3

Repository: NVIDIA/NemoClaw

Length of output: 2577


🏁 Script executed:

# Look for any recent commits or comments indicating this is temporary/experimental
git log --oneline -n 20 -- test/e2e/brev-e2e.test.js 2>/dev/null | head -10

Repository: NVIDIA/NemoClaw

Length of output: 142


🏁 Script executed:

# Check if there's a .github/workflows or similar CI directory
fd -t d ".github|.gitlab|.circleci" -o "workflows|pipelines" 2>/dev/null

Repository: NVIDIA/NemoClaw

Length of output: 41


🏁 Script executed:

# Read the setup wait loop (continuing from line 180)
sed -n '180,220p' test/e2e/brev-e2e.test.js

Repository: NVIDIA/NemoClaw

Length of output: 2321


🏁 Script executed:

# Check if there are any documented requirements about the feature branch or script pinning
rg "OpenShell-Community|feat/brev" -B 5 -A 5

Repository: NVIDIA/NemoClaw

Length of output: 3032


🏁 Script executed:

# Look for any tests or validation of the external script
rg "LAUNCHABLE|setup-script" -B 2 -A 2

Repository: NVIDIA/NemoClaw

Length of output: 2244


Pin the launchable bootstrap to an immutable revision.

LAUNCHABLE_SETUP_SCRIPT points at a raw GitHub URL on a feature branch (refs/heads/feat/brev-nemoclaw-plugin), and USE_LAUNCHABLE makes that path the default. A force-push on that branch can silently change what privileged CI machines execute, making the suite non-reproducible and widening the supply-chain blast radius. Pin to a commit SHA or vendor the script in this repository.

Additionally, the setup readiness loop (lines 180–220) times out silently after 40 minutes without raising an error, allowing the test to proceed with incomplete initialization. Add explicit timeout handling after the wait loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.js` around lines 37 - 42, LAUNCHABLE_SETUP_SCRIPT
currently points to a branch ref and USE_LAUNCHABLE defaults to using it; change
LAUNCHABLE_SETUP_SCRIPT to an immutable source (replace the branch ref with a
commit SHA in the raw GitHub URL or vendor the script into the repo and point
the constant to that vendored path) and keep USE_LAUNCHABLE logic intact. Also
update the test's setup readiness loop (the wait loop used to detect
initialization completion) to surface a clear failure: after the existing
polling loop add explicit timeout handling that throws or fails the test when
the overall wait exceeds the intended deadline (rather than silently
proceeding), so the test stops with a clear error if initialization never
completes.

Comment on lines +214 to +278
sanitize_result=$(cd "$REPO" && node -e "
const fs = require('fs');
const path = require('path');

// --- Credential field detection (mirrors migration-state.ts) ---
const CREDENTIAL_FIELDS = new Set([
'apiKey', 'api_key', 'token', 'secret', 'password', 'resolvedKey',
]);
const CREDENTIAL_FIELD_PATTERN =
/(?:access|refresh|client|bearer|auth|api|private|public|signing|session)(?:Token|Key|Secret|Password)$/;

function isCredentialField(key) {
return CREDENTIAL_FIELDS.has(key) || CREDENTIAL_FIELD_PATTERN.test(key);
}

function stripCredentials(obj) {
if (obj === null || obj === undefined) return obj;
if (typeof obj !== 'object') return obj;
if (Array.isArray(obj)) return obj.map(stripCredentials);
const result = {};
for (const [key, value] of Object.entries(obj)) {
if (isCredentialField(key)) {
result[key] = '[STRIPPED_BY_MIGRATION]';
} else {
result[key] = stripCredentials(value);
}
}
return result;
}

function walkAndRemoveFile(dirPath, targetName) {
let entries;
try { entries = fs.readdirSync(dirPath); } catch { return; }
for (const entry of entries) {
const fullPath = path.join(dirPath, entry);
try {
const stat = fs.lstatSync(fullPath);
if (stat.isSymbolicLink()) continue;
if (stat.isDirectory()) {
walkAndRemoveFile(fullPath, targetName);
} else if (entry === targetName) {
fs.rmSync(fullPath, { force: true });
}
} catch {}
}
}

const bundleDir = '$BUNDLE_DIR';

// 1. Remove auth-profiles.json
const agentsDir = path.join(bundleDir, 'agents');
if (fs.existsSync(agentsDir)) {
walkAndRemoveFile(agentsDir, 'auth-profiles.json');
}

// 2. Strip credential fields from openclaw.json
const configPath = path.join(bundleDir, 'openclaw.json');
if (fs.existsSync(configPath)) {
const config = JSON.parse(fs.readFileSync(configPath, 'utf-8'));
const sanitized = stripCredentials(config);
fs.writeFileSync(configPath, JSON.stringify(sanitized, null, 2));
}

console.log('SANITIZED');
" 2>&1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exercise the real migration sanitizer instead of local copies.

These checks validate hand-written stripCredentials() / walkAndRemoveFile() snippets rather than NemoClaw’s migration implementation. The copy already drifts from nemoclaw/src/commands/migration-state.ts: production deletes gateway entirely, while the assertions here expect gateway.mode plus a stripped gateway.auth.token. This suite can stay green while the shipped sanitizer or symlink handling regresses.

Also applies to: 315-367, 439-468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-credential-sanitization.sh` around lines 214 - 278, The test
currently embeds local implementations of stripCredentials and walkAndRemoveFile
instead of invoking NemoClaw’s real migration sanitizer from migration-state.ts;
update the test to call the actual sanitizer (the exported function(s) in
migration-state.ts) and the repository’s migration-aware removal logic so
assertions reflect production behavior (e.g., gateway removal vs. partial
stripping) rather than the bundled stripCredentials/walkAndRemoveFile copies;
locate and replace usages of stripCredentials and walkAndRemoveFile in this
script (and the other noted ranges 315-367, 439-468) to import/execute the
migration-state sanitizer/exported functions and assert on its real output.

Comment on lines +86 to +109
send_message_to_sandbox() {
local message="$1"
local session_id="${2:-e2e-injection-test}"

local ssh_config
ssh_config="$(mktemp)"
openshell sandbox ssh-config "$SANDBOX_NAME" >"$ssh_config" 2>/dev/null

# Use the same mechanism as the bridge: pass message as an argument
# via SSH. The key security property is that the message must NOT be
# interpreted as shell code on the remote side.
local result
result=$(timeout 90 ssh -F "$ssh_config" \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/dev/null \
-o ConnectTimeout=10 \
-o LogLevel=ERROR \
"openshell-${SANDBOX_NAME}" \
"echo 'INJECTION_PROBE_START' && echo $(printf '%q' "$message") && echo 'INJECTION_PROBE_END'" \
2>&1) || true

rm -f "$ssh_config"
echo "$result"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drive the real Telegram bridge code for the message-path probes.

These tests send payloads through bespoke ssh ... 'MSG=$(cat) ...' commands, not through scripts/telegram-bridge.js’s runAgentInSandbox() path (shellQuote(message), safeSessionId, nemoclaw-start ... -m ...). Because MSG=$(cat) stores the payload after the shell has already parsed the command, $(), backticks, and ${VAR} in the message are never re-evaluated here. The probes can therefore pass even if the bridge regresses.

Also applies to: 188-277, 403-444

🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 88-88: session_id appears unused. Verify use (or export if used externally).

(SC2034)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 86 - 109, The test
currently injects messages via an SSH command that echoes the payload
(send_message_to_sandbox), which bypasses the real bridge path and therefore
misses re-evaluation issues; change the test to drive the actual Telegram bridge
invocation used in scripts/telegram-bridge.js (the same path as
runAgentInSandbox / shellQuote(message) / safeSessionId / nemoclaw-start -m ...)
so the message is passed through the bridge’s argument handling rather than via
echo/$(cat). Concretely, replace the SSH payload string in
send_message_to_sandbox with a call that runs the bridge entry (the same command
runAgentInSandbox would use, passing the session id and message as -m/--message
and using the bridge’s quoting/safe-session logic) so the tests exercise the
real message-path quoting and session handling.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/e2e/test-telegram-injection.sh (2)

92-115: Dead code: send_message_to_sandbox() is defined but never called.

This function is declared with detailed documentation about mirroring the bridge's behavior, but none of the test cases (T1–T8b) actually invoke it. All tests use inline SSH commands directly. Either:

  1. Update the tests to use this helper for consistency, or
  2. Remove the dead code

Additionally, the session_id parameter on line 94 is unused (SC2034).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 92 - 115, The helper
function send_message_to_sandbox is dead code and its session_id parameter is
unused; either remove the entire send_message_to_sandbox function and delete the
unused session_id, or update the test cases (T1–T8b) to call
send_message_to_sandbox instead of the current inline ssh commands so the helper
is actually used; if you keep the helper, remove or use the session_id parameter
to silence SC2034 and update all places that constructed SSH invocations to call
send_message_to_sandbox with the message (and session_id if needed) for
consistency.

40-40: Consider adding -e to set options for early failure detection.

The script uses set -uo pipefail but omits -e, meaning commands can fail without halting the script. While this may be intentional (allowing all tests to run and report), failures in critical setup sections (e.g., lines 162–167) might go unnoticed if not carefully handled.

If intentional, consider documenting this choice in the header comments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` at line 40, The script's top-level "set
-uo pipefail" omits "-e", so failing commands (especially in the critical setup
block referenced by the reviewer) may not abort the run; either add "-e" to the
existing "set -uo pipefail" invocation to enable early exit on errors, or
explicitly document in the script header why "-e" was intentionally omitted and
ensure the setup block (the critical initialization steps) performs explicit
exit-on-failure checks (check return codes and call exit on failure) so failures
can't be silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 383-399: The test loop is vulnerable to shell command substitution
when invalid_name contains backticks because the node -e "..." string is
double-quoted; fix by avoiding unescaped interpolation into a double-quoted
command: pass the test value safely (e.g., export or pass invalid_name as a
positional argument or via printf '%s' to construct the node invocation) so
backticks are not executed, then call validateName(...) inside the Node snippet
using that safe input; update the node -e invocation that calls validateName to
read the value from process.env or process.argv instead of embedding
$invalid_name directly.

---

Nitpick comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 92-115: The helper function send_message_to_sandbox is dead code
and its session_id parameter is unused; either remove the entire
send_message_to_sandbox function and delete the unused session_id, or update the
test cases (T1–T8b) to call send_message_to_sandbox instead of the current
inline ssh commands so the helper is actually used; if you keep the helper,
remove or use the session_id parameter to silence SC2034 and update all places
that constructed SSH invocations to call send_message_to_sandbox with the
message (and session_id if needed) for consistency.
- Line 40: The script's top-level "set -uo pipefail" omits "-e", so failing
commands (especially in the critical setup block referenced by the reviewer) may
not abort the run; either add "-e" to the existing "set -uo pipefail" invocation
to enable early exit on errors, or explicitly document in the script header why
"-e" was intentionally omitted and ensure the setup block (the critical
initialization steps) performs explicit exit-on-failure checks (check return
codes and call exit on failure) so failures can't be silently ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca5cdcab-3f73-401c-bcb1-d85860438a73

📥 Commits

Reviewing files that changed from the base of the PR and between b548976 and 9c4ed17.

📒 Files selected for processing (2)
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/brev-e2e.test.js

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
test/e2e/test-credential-sanitization.sh (2)

487-591: ⚠️ Potential issue | 🟠 Major

C9-C11 are still self-fulfilling digest tests.

The inline verifyBlueprintDigest_FIXED() / verifyDigest() helpers never call NemoClaw’s real blueprint verification path. As written, these cases only prove the sample implementation behaves as expected, so a repo regression can still pass this suite.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-credential-sanitization.sh` around lines 487 - 591, The tests
currently exercise only local helpers (verifyBlueprintDigest_FIXED,
verifyBlueprintDigest_VULNERABLE, verifyDigest) rather than NemoClaw’s real
verification routine, so replace the inline helper calls with calls into the
actual verifier exported by the codebase (require/import the module that exposes
the NemoClaw blueprint verification function, e.g. verifyBlueprint or whatever
exported symbol the project uses), remove the stub functions, and pass real
manifest + blueprint content into that verifier; update assertions to check the
real verifier’s result (accepted vs rejected) for empty/undefined, wrong, and
correct digest cases so the test fails on regressions.

213-279: ⚠️ Potential issue | 🟠 Major

Exercise NemoClaw’s real sanitizer/removal code here.

These sections still define their own stripCredentials() / walkAndRemoveFile() inside node -e, so they only validate the pasted helpers. A regression in the shipped migration bundle logic can still leave C1-C8 and C12-C13 green. Wire these cases through the actual migration sanitizer/removal path instead.

Also applies to: 440-469, 607-690

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-credential-sanitization.sh` around lines 213 - 279, The test is
currently validating local helpers (stripCredentials/walkAndRemoveFile) instead
of exercising the real migration code; change the node -e invocation to require
and call the actual migration sanitizer/removal functions (e.g. call
sanitizeCredentialsInBundle(bundleDir) and the bundle auth-file removal function
exported by the migration module) rather than defining stripCredentials or
walkAndRemoveFile inline; locate and import the module that exports
sanitizeCredentialsInBundle (and the auth-profiles removal function), pass the
same bundleDir used by the test, await/handle its result, and print 'SANITIZED'
only after the real functions complete so the test verifies the shipped
migration logic end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 608-610: The test incorrectly treats keyRef as a preserved
credential field; update the CREDENTIAL_FIELDS Set in
test/e2e/test-credential-sanitization.sh (the constant CREDENTIAL_FIELDS) to
omit 'keyRef' (or remove any assertions that require keyRef to be preserved) so
the test expects keyRef to be stripped like production sanitizer; also apply the
same change to the other occurrences referenced around lines 697-734 to ensure
tests no longer fail when keyRef is intentionally removed by the sanitizer.
- Around line 73-90: sandbox_exec currently swallows exit codes from both the
`openshell sandbox ssh-config` and the `ssh` probe, so make it fail closed by
capturing and checking both exit statuses: run `openshell sandbox ssh-config`
and if it fails, remove the temp file and return a non-zero status (or echo a
sentinel like "__PROBE_FAILED__"), then run `ssh` capturing its exit code
separately (do not let `|| true` hide it); if `ssh` exits non-zero or produces
no stdout treat that as a failed probe (echo sentinel or propagate the ssh exit
code) and return non-zero so upstream checks fail. Apply the same treatment to
the other identical helper at the other location (the block around lines
395-417).

---

Duplicate comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 487-591: The tests currently exercise only local helpers
(verifyBlueprintDigest_FIXED, verifyBlueprintDigest_VULNERABLE, verifyDigest)
rather than NemoClaw’s real verification routine, so replace the inline helper
calls with calls into the actual verifier exported by the codebase
(require/import the module that exposes the NemoClaw blueprint verification
function, e.g. verifyBlueprint or whatever exported symbol the project uses),
remove the stub functions, and pass real manifest + blueprint content into that
verifier; update assertions to check the real verifier’s result (accepted vs
rejected) for empty/undefined, wrong, and correct digest cases so the test fails
on regressions.
- Around line 213-279: The test is currently validating local helpers
(stripCredentials/walkAndRemoveFile) instead of exercising the real migration
code; change the node -e invocation to require and call the actual migration
sanitizer/removal functions (e.g. call sanitizeCredentialsInBundle(bundleDir)
and the bundle auth-file removal function exported by the migration module)
rather than defining stripCredentials or walkAndRemoveFile inline; locate and
import the module that exports sanitizeCredentialsInBundle (and the
auth-profiles removal function), pass the same bundleDir used by the test,
await/handle its result, and print 'SANITIZED' only after the real functions
complete so the test verifies the shipped migration logic end-to-end.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b2ca588e-197f-452e-801a-3615b6806569

📥 Commits

Reviewing files that changed from the base of the PR and between 9c4ed17 and b86c294.

📒 Files selected for processing (1)
  • test/e2e/test-credential-sanitization.sh

…nitization

Adds two new Brev E2E test suites targeting the vulnerabilities fixed by
PR NVIDIA#119 (Telegram bridge command injection) and PR NVIDIA#156 (credential
exposure in migration snapshots + blueprint digest bypass).

Test suites:
- test-telegram-injection.sh: 8 tests covering command substitution,
  backtick injection, quote-breakout, parameter expansion, process
  table leaks, and SANDBOX_NAME validation
- test-credential-sanitization.sh: 13 tests covering auth-profiles.json
  deletion, credential field stripping, non-credential preservation,
  symlink safety, blueprint digest verification, and pattern-based
  field detection

These tests are expected to FAIL on main (unfixed code) and PASS
once PR NVIDIA#119 and NVIDIA#156 are merged.

Refs: NVIDIA#118, NVIDIA#119, NVIDIA#156, NVIDIA#813
- Add SKIP_VLLM=1 support to brev-setup.sh
- Use SKIP_VLLM=1 in brev-e2e.test.js bootstrap
- Bump beforeAll timeout to 30 min for CPU instances
- Bump workflow timeout to 60 min for 3 test suites
- Stream SSH output to CI log during bootstrap (no more silence)
- Add timestamps to brev-setup.sh and setup.sh info/warn/fail messages
- Add background progress reporter during sandbox Docker build (heartbeat every 30s showing elapsed time, current Docker step, and last log line)
- Stream test script output to CI log via tee + capture for assertions
- Filter potential secrets from progress heartbeat output
Replace bare 'brev create' + brev-setup.sh with 'brev start' using the
OpenShell-Community launch-nemoclaw.sh setup script. This installs Docker,
OpenShell CLI, and Node.js via the launchable's proven path, then runs
'nemoclaw onboard --non-interactive' to build the sandbox (testing whether
this path is faster than our manual setup.sh).

Changes:
- Default CPU back to 4x16 (8x32 didn't help — bottleneck was I/O)
- Launchable path: brev start + setup-script URL, poll for completion,
  rsync PR branch, npm ci, nemoclaw onboard
- Legacy path preserved (USE_LAUNCHABLE=0)
- Timestamped logging throughout for timing comparison
- New use_launchable workflow input (default: true)
… mode

openshell sandbox create without a command defaults to opening an interactive
shell inside the sandbox. In CI (non-interactive SSH), this hangs forever —
the sandbox goes Ready but the command never returns. The [?2004h] terminal
escape codes in CI logs were bash enabling bracketed paste mode, waiting for
input.

Add --no-tty -- true so the command exits immediately after the sandbox is
created and Ready.
The launchable setup script installs Node.js via nvm, which sets up PATH
in ~/.nvm/nvm.sh. Non-interactive SSH doesn't source .bashrc, so npm/node
commands fail with 'command not found'. Source nvm.sh before running npm
in the launchable path and runRemoteTest.
setup.sh defaulted to 'nemoclaw' ignoring the NEMOCLAW_SANDBOX_NAME env
var set by the CI test harness (e2e-test). Now uses $1 > $NEMOCLAW_SANDBOX_NAME > nemoclaw.
…box)

The full E2E test runs install.sh --non-interactive which destroys and
rebuilds the sandbox. When TEST_SUITE=all, this kills the sandbox that
beforeAll created, causing credential-sanitization and telegram-injection
to fail with 'sandbox not running'. Only run full E2E when TEST_SUITE=full.
On forks or before the first base-image workflow run, the GHCR base image
(ghcr.io/nvidia/nemoclaw/sandbox-base:latest) doesn't exist. This causes
the Dockerfile's FROM to fail. Now setup.sh checks for the base image
and builds Dockerfile.base locally if needed.

On subsequent builds, Docker layer cache makes this near-instant.
Once the GHCR base image is available, this becomes a no-op (docker pull
succeeds and the local build is skipped).
brev-setup.sh creates the sandbox but doesn't install the host-side
nemoclaw CLI that test scripts need for 'nemoclaw <name> status'.
Add npm install + build + link step after bootstrap.
setup.sh creates the sandbox via openshell directly but never writes
~/.nemoclaw/sandboxes.json. The security test scripts check
`nemoclaw <name> status` which reads the registry, causing all E2E
runs to fail with 'Sandbox e2e-test not running'.

Write the registry entry after nemoclaw CLI install so the test
scripts can find the sandbox.
C7 greps for 'npm_' inside the sandbox and false-positives on
nemoclaw-blueprint/policies/presets/npm.yaml which contains rule
names like 'npm_yarn', not actual credentials. Filter out /policies/
paths from all three pattern checks.
Document what each test_suite option runs so maintainers can make an
informed choice from the Actions UI without reading the test scripts.
Re-enable the github.repository check so the workflow only runs on
NVIDIA/NemoClaw, not on forks.
…nv var

setup.sh now uses ${1:-${NEMOCLAW_SANDBOX_NAME:-nemoclaw}} instead of
${1:-nemoclaw}. Update the test to match and add coverage for the env
var fallback path.
…s, shell injection in test

- Bump e2e-brev workflow timeout-minutes from 60 to 90
- Add fail-fast when launchable setup exceeds 40-min wait
- Add pipefail to remote pipeline commands in runRemoteTest and npm ci
- Fix backtick shell injection in validateName test loop (use process.argv)
- Make sandbox_exec fail closed with __PROBE_FAILED__ sentinel
- Add probe failure checks in C6/C7 sandbox assertions
@jyaunches jyaunches force-pushed the feat/security-e2e-tests branch from b86c294 to 32687e3 Compare March 30, 2026 16:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
test/e2e/test-credential-sanitization.sh (1)

226-290: ⚠️ Potential issue | 🟠 Major

Exercise NemoClaw’s real sanitizer and digest-verification paths.

These node -e blocks still define local credential-stripping, symlink-walk, and digest-verification helpers, so the suite can pass without touching the code it is supposed to protect. The copy has already drifted from production — keyRef is omitted from CREDENTIAL_FIELDS here and C13 treats preserving it as success — which means these checks can miss real regressions and also report correct shipped behavior as broken. Please drive C1-C5/C8/C9-C13 through the actual migration-state / blueprint verification entrypoints instead of inline helpers.

Based on learnings, keyRef is intentionally included in CREDENTIAL_FIELDS in nemoclaw/src/commands/migration-state.ts because it currently holds actual secret material.

Also applies to: 455-484, 502-542, 559-606, 622-752

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 331-367: The test currently shells out to python3 to parse JSON
(variables nvidia_apikey, gateway_token, model_primary, gateway_mode), which
adds an undeclared Python dependency; replace each echo "$config_content" |
python3 -c "..." block with an echo "$config_content" | node -e "const
d=require('fs').readFileSync(0,'utf8'); try{const cfg=JSON.parse(d);
console.log(<field-expr>)}catch(e){process.exitCode=1}" pattern so Node parses
the JSON and prints the same fields (use cfg.nvidia?.apiKey ?? 'MISSING',
cfg.gateway?.auth?.token ?? 'MISSING', cfg.agents?.defaults?.model?.primary ??
'MISSING', cfg.gateway?.mode ?? 'MISSING'), preserving the existing || echo
"PARSE_ERROR" fallback behavior; update the four spots that currently set
nvidia_apikey, gateway_token, model_primary, and gateway_mode.
- Around line 75-102: The remote probe pipelines run under SSH without pipefail
so earlier stages (find/grep) can fail while the final command (head) returns 0,
producing empty output and a false "no leaks" result; update sandbox_exec to
execute the remote command under a shell with pipefail (e.g., run the provided
"$cmd" via bash -o pipefail -c ...) so any pipeline failure returns non‑zero and
causes sandbox_exec to emit the "__PROBE_FAILED__" sentinel, and apply the same
change to the other probe invocations that use pipeline commands (the other
sandbox probe blocks referenced).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 87d5e831-522e-4128-bb20-09465d31d7c6

📥 Commits

Reviewing files that changed from the base of the PR and between b86c294 and 32687e3.

📒 Files selected for processing (7)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-setup.sh
  • scripts/setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-credential-sanitization.sh
  • test/e2e/test-telegram-injection.sh
  • test/setup-sandbox-name.test.js
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/brev-e2e.test.js
🚧 Files skipped from review as they are similar to previous changes (4)
  • scripts/brev-setup.sh
  • .github/workflows/e2e-brev.yaml
  • test/e2e/test-telegram-injection.sh
  • scripts/setup.sh

Comment on lines +75 to +102
sandbox_exec() {
local cmd="$1"
local ssh_config
ssh_config="$(mktemp)"
if ! openshell sandbox ssh-config "$SANDBOX_NAME" >"$ssh_config" 2>/dev/null; then
rm -f "$ssh_config"
echo "__PROBE_FAILED__"
return 1
fi

local result
local rc=0
result=$(timeout 60 ssh -F "$ssh_config" \
-o StrictHostKeyChecking=no \
-o UserKnownHostsFile=/dev/null \
-o ConnectTimeout=10 \
-o LogLevel=ERROR \
"openshell-${SANDBOX_NAME}" \
"$cmd" \
2>&1) || rc=$?

rm -f "$ssh_config"
if [ "$rc" -ne 0 ] && [ -z "$result" ]; then
echo "__PROBE_FAILED__"
return 1
fi
echo "$result"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

C6/C7 can still pass when the remote probe errors out.

The commands sent through sandbox_exec() are pipelines with stderr discarded (find ... | head, grep ... | ... | head). If find/grep fails because a target path is missing or unreadable, head still exits 0, sandbox_exec() sees empty output, and the check reports “no leaks” even though nothing was inspected. Run these probes under remote pipefail or return an explicit sentinel when any stage of the pipeline fails.

Also applies to: 408-425, 427-433

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-credential-sanitization.sh` around lines 75 - 102, The remote
probe pipelines run under SSH without pipefail so earlier stages (find/grep) can
fail while the final command (head) returns 0, producing empty output and a
false "no leaks" result; update sandbox_exec to execute the remote command under
a shell with pipefail (e.g., run the provided "$cmd" via bash -o pipefail -c
...) so any pipeline failure returns non‑zero and causes sandbox_exec to emit
the "__PROBE_FAILED__" sentinel, and apply the same change to the other probe
invocations that use pipeline commands (the other sandbox probe blocks
referenced).

Comment on lines +331 to +367
nvidia_apikey=$(echo "$config_content" | python3 -c "
import json, sys
config = json.load(sys.stdin)
print(config.get('nvidia', {}).get('apiKey', 'MISSING'))
" 2>/dev/null || echo "PARSE_ERROR")

gateway_token=$(echo "$config_content" | python3 -c "
import json, sys
config = json.load(sys.stdin)
print(config.get('gateway', {}).get('auth', {}).get('token', 'MISSING'))
" 2>/dev/null || echo "PARSE_ERROR")

if [ "$nvidia_apikey" = "[STRIPPED_BY_MIGRATION]" ]; then
pass "C3a: nvidia.apiKey replaced with sentinel"
else
fail "C3a: nvidia.apiKey not sanitized (got: $nvidia_apikey)"
fi

if [ "$gateway_token" = "[STRIPPED_BY_MIGRATION]" ]; then
pass "C3b: gateway.auth.token replaced with sentinel"
else
fail "C3b: gateway.auth.token not sanitized (got: $gateway_token)"
fi

# C4: Non-credential fields must be preserved
info "C4: Checking non-credential field preservation..."
model_primary=$(echo "$config_content" | python3 -c "
import json, sys
config = json.load(sys.stdin)
print(config.get('agents', {}).get('defaults', {}).get('model', {}).get('primary', 'MISSING'))
" 2>/dev/null || echo "PARSE_ERROR")

gateway_mode=$(echo "$config_content" | python3 -c "
import json, sys
config = json.load(sys.stdin)
print(config.get('gateway', {}).get('mode', 'MISSING'))
" 2>/dev/null || echo "PARSE_ERROR")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

C3/C4 add an undeclared Python dependency.

These assertions shell out to python3, but Phase 0 never checks for it and the rest of the script already standardizes on Node. On images without Python 3, this suite will fail with PARSE_ERROR for test-infrastructure reasons rather than sanitization regressions. Either add a prerequisite check or do these JSON reads with node.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-credential-sanitization.sh` around lines 331 - 367, The test
currently shells out to python3 to parse JSON (variables nvidia_apikey,
gateway_token, model_primary, gateway_mode), which adds an undeclared Python
dependency; replace each echo "$config_content" | python3 -c "..." block with an
echo "$config_content" | node -e "const d=require('fs').readFileSync(0,'utf8');
try{const cfg=JSON.parse(d);
console.log(<field-expr>)}catch(e){process.exitCode=1}" pattern so Node parses
the JSON and prints the same fields (use cfg.nvidia?.apiKey ?? 'MISSING',
cfg.gateway?.auth?.token ?? 'MISSING', cfg.agents?.defaults?.model?.primary ??
'MISSING', cfg.gateway?.mode ?? 'MISSING'), preserving the existing || echo
"PARSE_ERROR" fallback behavior; update the four spots that currently set
nvidia_apikey, gateway_token, model_primary, and gateway_mode.

Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this — the direction is good, but I don't think this is mergeable as-is.

Main issues I verified:

  1. test/e2e/test-credential-sanitization.sh reimplements stripCredentials, walkAndRemoveFile, and digest verification inline instead of invoking the real code in nemoclaw/src/commands/migration-state.ts. That makes large parts of the suite self-fulfilling rather than true end-to-end coverage.

  2. The credential suite does not appear to drive the real snapshot/restore path end-to-end, so it is not yet validating the production migration flow it claims to cover.

  3. test/e2e/test-telegram-injection.sh does not exercise scripts/telegram-bridge.js / runAgentInSandbox(). It uses ad-hoc SSH commands like MSG=$(cat) && echo ..., which bypass the actual quoting/session-handling path we need confidence in.

  4. test/e2e/brev-e2e.test.js defaults to downloading/executing a raw setup script from a feature branch in OpenShell-Community. Please pin that to an immutable commit SHA or vendor it into this repo.

I think this is worth landing once the tests are wired to the real production paths and the launchable bootstrap is pinned. Also please rebase after #1090 / current main, since this PR currently carries the setup script changes too.

Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to @jyaunches offline and we're okay with the changes I requested being done in follow-up PRs

@cv cv merged commit bd5425c into NVIDIA:main Mar 30, 2026
8 checks passed
@wscurran wscurran added Platform: Brev Support for Brev deployment security Something isn't secure CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. labels Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. Platform: Brev Support for Brev deployment security Something isn't secure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants