test(security): add Brev E2E tests for command injection and credential sanitization#1092
test(security): add Brev E2E tests for command injection and credential sanitization#1092cv merged 26 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds optional NemoClaw "launchable" provisioning (USE_LAUNCHABLE), expands CI workflow inputs (adds Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CI as CI Runner
participant Remote as Remote VM
participant Launch as NemoClaw Launchable
participant Sandbox as Sandbox
CI->>Remote: SSH to provision (send USE_LAUNCHABLE)
alt USE_LAUNCHABLE=1
Remote->>Launch: run launch-nemoclaw.sh (request provisioning)
Launch-->>Remote: stream provisioning status
loop Poll readiness
CI->>Remote: check readiness marker/logs
end
CI->>Remote: rsync repo, npm ci, run tests (remote tee -> /tmp/test-output.log)
Remote->>Sandbox: run `nemoclaw onboard` / openshell commands
Sandbox-->>Remote: test output & logs
Remote-->>CI: streamed output and fetched logs
else USE_LAUNCHABLE=0
CI->>Remote: run legacy `brev create` + `scripts/brev-setup.sh`
Remote->>Remote: local bootstrap (vLLM gated by SKIP_VLLM), build sandbox
Remote-->>CI: streamed bootstrap output and logs
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/e2e-brev.yaml:
- Line 99: The workflow's timeout-minutes (symbol: timeout-minutes) is too low
relative to the test suite budget: increase the GitHub Actions job timeout to
exceed the test's beforeAll 45-minute claim plus the two possible 10-minute test
slots and overhead so afterAll can run; update timeout-minutes to a safe value
(e.g., 90 or 120) in the workflow so runs with TEST_SUITE=all and the
beforeAll/afterAll in test/e2e/brev-e2e.test.js can complete without being
terminated.
In `@test/e2e/brev-e2e.test.js`:
- Around line 111-123: The remote pipeline is losing failure status because
pipefail isn't enabled; update runRemoteTest (and the similar block at lines
223-225) so the remote shell runs with pipefail enabled before streaming and
teeing outputs — e.g., ensure the constructed remote command uses "set -o
pipefail" or invokes bash with -o pipefail (so failures in `bash ${scriptPath} |
tee /tmp/test-output.log` and `npm ci ... | tail -5` propagate); modify the
command built for sshWithSecrets/ssh in runRemoteTest to enable pipefail so the
exit code of the pipeline reflects the first failing command.
- Around line 177-205: The polling loop that checks /tmp/launch-plugin.log and
~/.cache/nemoclaw-plugin/install-ran (using the ssh calls and variables
setupStart, setupMaxWait, setupPollInterval) currently falls through after 40
minutes allowing later steps to run; modify the code so that if neither
readiness marker ("=== Ready ===" in log nor markerCheck returning DONE) is
detected before Date.now() - setupStart >= setupMaxWait, the test fails fast by
throwing an error (or calling process.exit(1)) with a clear message including
both markers and elapsed time; place this failure immediately after the while
loop (or by breaking the loop into a condition that throws when timed out) so
rsync/npm ci/nemoclaw onboard are not executed on an incomplete VM.
- Around line 37-42: LAUNCHABLE_SETUP_SCRIPT currently points to a branch ref
and USE_LAUNCHABLE defaults to using it; change LAUNCHABLE_SETUP_SCRIPT to an
immutable source (replace the branch ref with a commit SHA in the raw GitHub URL
or vendor the script into the repo and point the constant to that vendored path)
and keep USE_LAUNCHABLE logic intact. Also update the test's setup readiness
loop (the wait loop used to detect initialization completion) to surface a clear
failure: after the existing polling loop add explicit timeout handling that
throws or fails the test when the overall wait exceeds the intended deadline
(rather than silently proceeding), so the test stops with a clear error if
initialization never completes.
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 214-278: The test currently embeds local implementations of
stripCredentials and walkAndRemoveFile instead of invoking NemoClaw’s real
migration sanitizer from migration-state.ts; update the test to call the actual
sanitizer (the exported function(s) in migration-state.ts) and the repository’s
migration-aware removal logic so assertions reflect production behavior (e.g.,
gateway removal vs. partial stripping) rather than the bundled
stripCredentials/walkAndRemoveFile copies; locate and replace usages of
stripCredentials and walkAndRemoveFile in this script (and the other noted
ranges 315-367, 439-468) to import/execute the migration-state
sanitizer/exported functions and assert on its real output.
- Around line 486-590: The test is self-fulfilling because it defines local
helpers (verifyBlueprintDigest_FIXED and verifyDigest) inside node -e and never
exercises the repository's real digest-verification code; replace the inline
helpers with calls that exercise the actual verification implementation (e.g.,
require the module that exports the real verify/validate function or invoke the
CLI/entrypoint that performs manifest digest checks) and feed it crafted
manifests (empty string, undefined, wrong digest, correct digest) and blueprint
content so the test fails if the repo code is vulnerable; update the assertions
to look for the real tool's output/exit status instead of the local helper
results (reference the verifyBlueprintDigest_FIXED/verifyDigest helpers in the
diff to locate the test sections to modify).
In `@test/e2e/test-telegram-injection.sh`:
- Around line 86-109: The test currently injects messages via an SSH command
that echoes the payload (send_message_to_sandbox), which bypasses the real
bridge path and therefore misses re-evaluation issues; change the test to drive
the actual Telegram bridge invocation used in scripts/telegram-bridge.js (the
same path as runAgentInSandbox / shellQuote(message) / safeSessionId /
nemoclaw-start -m ...) so the message is passed through the bridge’s argument
handling rather than via echo/$(cat). Concretely, replace the SSH payload string
in send_message_to_sandbox with a call that runs the bridge entry (the same
command runAgentInSandbox would use, passing the session id and message as
-m/--message and using the bridge’s quoting/safe-session logic) so the tests
exercise the real message-path quoting and session handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c43264aa-fc09-4107-a384-03bbbcb23a8a
📒 Files selected for processing (6)
.github/workflows/e2e-brev.yamlscripts/brev-setup.shscripts/setup.shtest/e2e/brev-e2e.test.jstest/e2e/test-credential-sanitization.shtest/e2e/test-telegram-injection.sh
| const LAUNCHABLE_SETUP_SCRIPT = | ||
| "https://raw.githubusercontent.com/NVIDIA/OpenShell-Community/refs/heads/feat/brev-nemoclaw-plugin/brev/launch-nemoclaw.sh"; | ||
| const NEMOCLAW_REPO_URL = "https://github.com/NVIDIA/NemoClaw.git"; | ||
|
|
||
| // Use launchable by default; set USE_LAUNCHABLE=0 or USE_LAUNCHABLE=false to fall back to brev-setup.sh | ||
| const USE_LAUNCHABLE = !["0", "false"].includes(process.env.USE_LAUNCHABLE?.toLowerCase()); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the test file content at the specified lines
head -n 165 test/e2e/brev-e2e.test.js | tail -n +30Repository: NVIDIA/NemoClaw
Length of output: 5536
🏁 Script executed:
# Get full file size to understand scope
wc -l test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 88
🏁 Script executed:
# Search for all references to LAUNCHABLE_SETUP_SCRIPT and USE_LAUNCHABLE
rg "LAUNCHABLE_SETUP_SCRIPT|USE_LAUNCHABLE" -nRepository: NVIDIA/NemoClaw
Length of output: 621
🏁 Script executed:
# Check for CI/CD configuration files that might reference this test
fd -t f -e yml -e yaml -e json | head -20 | xargs grep -l "brev-e2e\|e2e" 2>/dev/nullRepository: NVIDIA/NemoClaw
Length of output: 43
🏁 Script executed:
# Read the launchable path section in detail (lines 144-180)
sed -n '144,180p' test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 2035
🏁 Script executed:
# Check if there's any validation, pinning, or comments about the feature branch
rg "feat/brev-nemoclaw-plugin|LAUNCHABLE|launch-nemoclaw" -B 3 -A 3Repository: NVIDIA/NemoClaw
Length of output: 2577
🏁 Script executed:
# Look for any recent commits or comments indicating this is temporary/experimental
git log --oneline -n 20 -- test/e2e/brev-e2e.test.js 2>/dev/null | head -10Repository: NVIDIA/NemoClaw
Length of output: 142
🏁 Script executed:
# Check if there's a .github/workflows or similar CI directory
fd -t d ".github|.gitlab|.circleci" -o "workflows|pipelines" 2>/dev/nullRepository: NVIDIA/NemoClaw
Length of output: 41
🏁 Script executed:
# Read the setup wait loop (continuing from line 180)
sed -n '180,220p' test/e2e/brev-e2e.test.jsRepository: NVIDIA/NemoClaw
Length of output: 2321
🏁 Script executed:
# Check if there are any documented requirements about the feature branch or script pinning
rg "OpenShell-Community|feat/brev" -B 5 -A 5Repository: NVIDIA/NemoClaw
Length of output: 3032
🏁 Script executed:
# Look for any tests or validation of the external script
rg "LAUNCHABLE|setup-script" -B 2 -A 2Repository: NVIDIA/NemoClaw
Length of output: 2244
Pin the launchable bootstrap to an immutable revision.
LAUNCHABLE_SETUP_SCRIPT points at a raw GitHub URL on a feature branch (refs/heads/feat/brev-nemoclaw-plugin), and USE_LAUNCHABLE makes that path the default. A force-push on that branch can silently change what privileged CI machines execute, making the suite non-reproducible and widening the supply-chain blast radius. Pin to a commit SHA or vendor the script in this repository.
Additionally, the setup readiness loop (lines 180–220) times out silently after 40 minutes without raising an error, allowing the test to proceed with incomplete initialization. Add explicit timeout handling after the wait loop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/brev-e2e.test.js` around lines 37 - 42, LAUNCHABLE_SETUP_SCRIPT
currently points to a branch ref and USE_LAUNCHABLE defaults to using it; change
LAUNCHABLE_SETUP_SCRIPT to an immutable source (replace the branch ref with a
commit SHA in the raw GitHub URL or vendor the script into the repo and point
the constant to that vendored path) and keep USE_LAUNCHABLE logic intact. Also
update the test's setup readiness loop (the wait loop used to detect
initialization completion) to surface a clear failure: after the existing
polling loop add explicit timeout handling that throws or fails the test when
the overall wait exceeds the intended deadline (rather than silently
proceeding), so the test stops with a clear error if initialization never
completes.
| sanitize_result=$(cd "$REPO" && node -e " | ||
| const fs = require('fs'); | ||
| const path = require('path'); | ||
|
|
||
| // --- Credential field detection (mirrors migration-state.ts) --- | ||
| const CREDENTIAL_FIELDS = new Set([ | ||
| 'apiKey', 'api_key', 'token', 'secret', 'password', 'resolvedKey', | ||
| ]); | ||
| const CREDENTIAL_FIELD_PATTERN = | ||
| /(?:access|refresh|client|bearer|auth|api|private|public|signing|session)(?:Token|Key|Secret|Password)$/; | ||
|
|
||
| function isCredentialField(key) { | ||
| return CREDENTIAL_FIELDS.has(key) || CREDENTIAL_FIELD_PATTERN.test(key); | ||
| } | ||
|
|
||
| function stripCredentials(obj) { | ||
| if (obj === null || obj === undefined) return obj; | ||
| if (typeof obj !== 'object') return obj; | ||
| if (Array.isArray(obj)) return obj.map(stripCredentials); | ||
| const result = {}; | ||
| for (const [key, value] of Object.entries(obj)) { | ||
| if (isCredentialField(key)) { | ||
| result[key] = '[STRIPPED_BY_MIGRATION]'; | ||
| } else { | ||
| result[key] = stripCredentials(value); | ||
| } | ||
| } | ||
| return result; | ||
| } | ||
|
|
||
| function walkAndRemoveFile(dirPath, targetName) { | ||
| let entries; | ||
| try { entries = fs.readdirSync(dirPath); } catch { return; } | ||
| for (const entry of entries) { | ||
| const fullPath = path.join(dirPath, entry); | ||
| try { | ||
| const stat = fs.lstatSync(fullPath); | ||
| if (stat.isSymbolicLink()) continue; | ||
| if (stat.isDirectory()) { | ||
| walkAndRemoveFile(fullPath, targetName); | ||
| } else if (entry === targetName) { | ||
| fs.rmSync(fullPath, { force: true }); | ||
| } | ||
| } catch {} | ||
| } | ||
| } | ||
|
|
||
| const bundleDir = '$BUNDLE_DIR'; | ||
|
|
||
| // 1. Remove auth-profiles.json | ||
| const agentsDir = path.join(bundleDir, 'agents'); | ||
| if (fs.existsSync(agentsDir)) { | ||
| walkAndRemoveFile(agentsDir, 'auth-profiles.json'); | ||
| } | ||
|
|
||
| // 2. Strip credential fields from openclaw.json | ||
| const configPath = path.join(bundleDir, 'openclaw.json'); | ||
| if (fs.existsSync(configPath)) { | ||
| const config = JSON.parse(fs.readFileSync(configPath, 'utf-8')); | ||
| const sanitized = stripCredentials(config); | ||
| fs.writeFileSync(configPath, JSON.stringify(sanitized, null, 2)); | ||
| } | ||
|
|
||
| console.log('SANITIZED'); | ||
| " 2>&1) |
There was a problem hiding this comment.
Exercise the real migration sanitizer instead of local copies.
These checks validate hand-written stripCredentials() / walkAndRemoveFile() snippets rather than NemoClaw’s migration implementation. The copy already drifts from nemoclaw/src/commands/migration-state.ts: production deletes gateway entirely, while the assertions here expect gateway.mode plus a stripped gateway.auth.token. This suite can stay green while the shipped sanitizer or symlink handling regresses.
Also applies to: 315-367, 439-468
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-credential-sanitization.sh` around lines 214 - 278, The test
currently embeds local implementations of stripCredentials and walkAndRemoveFile
instead of invoking NemoClaw’s real migration sanitizer from migration-state.ts;
update the test to call the actual sanitizer (the exported function(s) in
migration-state.ts) and the repository’s migration-aware removal logic so
assertions reflect production behavior (e.g., gateway removal vs. partial
stripping) rather than the bundled stripCredentials/walkAndRemoveFile copies;
locate and replace usages of stripCredentials and walkAndRemoveFile in this
script (and the other noted ranges 315-367, 439-468) to import/execute the
migration-state sanitizer/exported functions and assert on its real output.
| send_message_to_sandbox() { | ||
| local message="$1" | ||
| local session_id="${2:-e2e-injection-test}" | ||
|
|
||
| local ssh_config | ||
| ssh_config="$(mktemp)" | ||
| openshell sandbox ssh-config "$SANDBOX_NAME" >"$ssh_config" 2>/dev/null | ||
|
|
||
| # Use the same mechanism as the bridge: pass message as an argument | ||
| # via SSH. The key security property is that the message must NOT be | ||
| # interpreted as shell code on the remote side. | ||
| local result | ||
| result=$(timeout 90 ssh -F "$ssh_config" \ | ||
| -o StrictHostKeyChecking=no \ | ||
| -o UserKnownHostsFile=/dev/null \ | ||
| -o ConnectTimeout=10 \ | ||
| -o LogLevel=ERROR \ | ||
| "openshell-${SANDBOX_NAME}" \ | ||
| "echo 'INJECTION_PROBE_START' && echo $(printf '%q' "$message") && echo 'INJECTION_PROBE_END'" \ | ||
| 2>&1) || true | ||
|
|
||
| rm -f "$ssh_config" | ||
| echo "$result" | ||
| } |
There was a problem hiding this comment.
Drive the real Telegram bridge code for the message-path probes.
These tests send payloads through bespoke ssh ... 'MSG=$(cat) ...' commands, not through scripts/telegram-bridge.js’s runAgentInSandbox() path (shellQuote(message), safeSessionId, nemoclaw-start ... -m ...). Because MSG=$(cat) stores the payload after the shell has already parsed the command, $(), backticks, and ${VAR} in the message are never re-evaluated here. The probes can therefore pass even if the bridge regresses.
Also applies to: 188-277, 403-444
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 88-88: session_id appears unused. Verify use (or export if used externally).
(SC2034)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-telegram-injection.sh` around lines 86 - 109, The test
currently injects messages via an SSH command that echoes the payload
(send_message_to_sandbox), which bypasses the real bridge path and therefore
misses re-evaluation issues; change the test to drive the actual Telegram bridge
invocation used in scripts/telegram-bridge.js (the same path as
runAgentInSandbox / shellQuote(message) / safeSessionId / nemoclaw-start -m ...)
so the message is passed through the bridge’s argument handling rather than via
echo/$(cat). Concretely, replace the SSH payload string in
send_message_to_sandbox with a call that runs the bridge entry (the same command
runAgentInSandbox would use, passing the session id and message as -m/--message
and using the bridge’s quoting/safe-session logic) so the tests exercise the
real message-path quoting and session handling.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/e2e/test-telegram-injection.sh (2)
92-115: Dead code:send_message_to_sandbox()is defined but never called.This function is declared with detailed documentation about mirroring the bridge's behavior, but none of the test cases (T1–T8b) actually invoke it. All tests use inline SSH commands directly. Either:
- Update the tests to use this helper for consistency, or
- Remove the dead code
Additionally, the
session_idparameter on line 94 is unused (SC2034).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-telegram-injection.sh` around lines 92 - 115, The helper function send_message_to_sandbox is dead code and its session_id parameter is unused; either remove the entire send_message_to_sandbox function and delete the unused session_id, or update the test cases (T1–T8b) to call send_message_to_sandbox instead of the current inline ssh commands so the helper is actually used; if you keep the helper, remove or use the session_id parameter to silence SC2034 and update all places that constructed SSH invocations to call send_message_to_sandbox with the message (and session_id if needed) for consistency.
40-40: Consider adding-etosetoptions for early failure detection.The script uses
set -uo pipefailbut omits-e, meaning commands can fail without halting the script. While this may be intentional (allowing all tests to run and report), failures in critical setup sections (e.g., lines 162–167) might go unnoticed if not carefully handled.If intentional, consider documenting this choice in the header comments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-telegram-injection.sh` at line 40, The script's top-level "set -uo pipefail" omits "-e", so failing commands (especially in the critical setup block referenced by the reviewer) may not abort the run; either add "-e" to the existing "set -uo pipefail" invocation to enable early exit on errors, or explicitly document in the script header why "-e" was intentionally omitted and ensure the setup block (the critical initialization steps) performs explicit exit-on-failure checks (check return codes and call exit on failure) so failures can't be silently ignored.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 383-399: The test loop is vulnerable to shell command substitution
when invalid_name contains backticks because the node -e "..." string is
double-quoted; fix by avoiding unescaped interpolation into a double-quoted
command: pass the test value safely (e.g., export or pass invalid_name as a
positional argument or via printf '%s' to construct the node invocation) so
backticks are not executed, then call validateName(...) inside the Node snippet
using that safe input; update the node -e invocation that calls validateName to
read the value from process.env or process.argv instead of embedding
$invalid_name directly.
---
Nitpick comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 92-115: The helper function send_message_to_sandbox is dead code
and its session_id parameter is unused; either remove the entire
send_message_to_sandbox function and delete the unused session_id, or update the
test cases (T1–T8b) to call send_message_to_sandbox instead of the current
inline ssh commands so the helper is actually used; if you keep the helper,
remove or use the session_id parameter to silence SC2034 and update all places
that constructed SSH invocations to call send_message_to_sandbox with the
message (and session_id if needed) for consistency.
- Line 40: The script's top-level "set -uo pipefail" omits "-e", so failing
commands (especially in the critical setup block referenced by the reviewer) may
not abort the run; either add "-e" to the existing "set -uo pipefail" invocation
to enable early exit on errors, or explicitly document in the script header why
"-e" was intentionally omitted and ensure the setup block (the critical
initialization steps) performs explicit exit-on-failure checks (check return
codes and call exit on failure) so failures can't be silently ignored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ca5cdcab-3f73-401c-bcb1-d85860438a73
📒 Files selected for processing (2)
test/e2e/brev-e2e.test.jstest/e2e/test-telegram-injection.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- test/e2e/brev-e2e.test.js
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
test/e2e/test-credential-sanitization.sh (2)
487-591:⚠️ Potential issue | 🟠 MajorC9-C11 are still self-fulfilling digest tests.
The inline
verifyBlueprintDigest_FIXED()/verifyDigest()helpers never call NemoClaw’s real blueprint verification path. As written, these cases only prove the sample implementation behaves as expected, so a repo regression can still pass this suite.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-credential-sanitization.sh` around lines 487 - 591, The tests currently exercise only local helpers (verifyBlueprintDigest_FIXED, verifyBlueprintDigest_VULNERABLE, verifyDigest) rather than NemoClaw’s real verification routine, so replace the inline helper calls with calls into the actual verifier exported by the codebase (require/import the module that exposes the NemoClaw blueprint verification function, e.g. verifyBlueprint or whatever exported symbol the project uses), remove the stub functions, and pass real manifest + blueprint content into that verifier; update assertions to check the real verifier’s result (accepted vs rejected) for empty/undefined, wrong, and correct digest cases so the test fails on regressions.
213-279:⚠️ Potential issue | 🟠 MajorExercise NemoClaw’s real sanitizer/removal code here.
These sections still define their own
stripCredentials()/walkAndRemoveFile()insidenode -e, so they only validate the pasted helpers. A regression in the shipped migration bundle logic can still leave C1-C8 and C12-C13 green. Wire these cases through the actual migration sanitizer/removal path instead.Also applies to: 440-469, 607-690
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-credential-sanitization.sh` around lines 213 - 279, The test is currently validating local helpers (stripCredentials/walkAndRemoveFile) instead of exercising the real migration code; change the node -e invocation to require and call the actual migration sanitizer/removal functions (e.g. call sanitizeCredentialsInBundle(bundleDir) and the bundle auth-file removal function exported by the migration module) rather than defining stripCredentials or walkAndRemoveFile inline; locate and import the module that exports sanitizeCredentialsInBundle (and the auth-profiles removal function), pass the same bundleDir used by the test, await/handle its result, and print 'SANITIZED' only after the real functions complete so the test verifies the shipped migration logic end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 608-610: The test incorrectly treats keyRef as a preserved
credential field; update the CREDENTIAL_FIELDS Set in
test/e2e/test-credential-sanitization.sh (the constant CREDENTIAL_FIELDS) to
omit 'keyRef' (or remove any assertions that require keyRef to be preserved) so
the test expects keyRef to be stripped like production sanitizer; also apply the
same change to the other occurrences referenced around lines 697-734 to ensure
tests no longer fail when keyRef is intentionally removed by the sanitizer.
- Around line 73-90: sandbox_exec currently swallows exit codes from both the
`openshell sandbox ssh-config` and the `ssh` probe, so make it fail closed by
capturing and checking both exit statuses: run `openshell sandbox ssh-config`
and if it fails, remove the temp file and return a non-zero status (or echo a
sentinel like "__PROBE_FAILED__"), then run `ssh` capturing its exit code
separately (do not let `|| true` hide it); if `ssh` exits non-zero or produces
no stdout treat that as a failed probe (echo sentinel or propagate the ssh exit
code) and return non-zero so upstream checks fail. Apply the same treatment to
the other identical helper at the other location (the block around lines
395-417).
---
Duplicate comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 487-591: The tests currently exercise only local helpers
(verifyBlueprintDigest_FIXED, verifyBlueprintDigest_VULNERABLE, verifyDigest)
rather than NemoClaw’s real verification routine, so replace the inline helper
calls with calls into the actual verifier exported by the codebase
(require/import the module that exposes the NemoClaw blueprint verification
function, e.g. verifyBlueprint or whatever exported symbol the project uses),
remove the stub functions, and pass real manifest + blueprint content into that
verifier; update assertions to check the real verifier’s result (accepted vs
rejected) for empty/undefined, wrong, and correct digest cases so the test fails
on regressions.
- Around line 213-279: The test is currently validating local helpers
(stripCredentials/walkAndRemoveFile) instead of exercising the real migration
code; change the node -e invocation to require and call the actual migration
sanitizer/removal functions (e.g. call sanitizeCredentialsInBundle(bundleDir)
and the bundle auth-file removal function exported by the migration module)
rather than defining stripCredentials or walkAndRemoveFile inline; locate and
import the module that exports sanitizeCredentialsInBundle (and the
auth-profiles removal function), pass the same bundleDir used by the test,
await/handle its result, and print 'SANITIZED' only after the real functions
complete so the test verifies the shipped migration logic end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b2ca588e-197f-452e-801a-3615b6806569
📒 Files selected for processing (1)
test/e2e/test-credential-sanitization.sh
…nitization Adds two new Brev E2E test suites targeting the vulnerabilities fixed by PR NVIDIA#119 (Telegram bridge command injection) and PR NVIDIA#156 (credential exposure in migration snapshots + blueprint digest bypass). Test suites: - test-telegram-injection.sh: 8 tests covering command substitution, backtick injection, quote-breakout, parameter expansion, process table leaks, and SANDBOX_NAME validation - test-credential-sanitization.sh: 13 tests covering auth-profiles.json deletion, credential field stripping, non-credential preservation, symlink safety, blueprint digest verification, and pattern-based field detection These tests are expected to FAIL on main (unfixed code) and PASS once PR NVIDIA#119 and NVIDIA#156 are merged. Refs: NVIDIA#118, NVIDIA#119, NVIDIA#156, NVIDIA#813
- Add SKIP_VLLM=1 support to brev-setup.sh - Use SKIP_VLLM=1 in brev-e2e.test.js bootstrap - Bump beforeAll timeout to 30 min for CPU instances - Bump workflow timeout to 60 min for 3 test suites
- Stream SSH output to CI log during bootstrap (no more silence) - Add timestamps to brev-setup.sh and setup.sh info/warn/fail messages - Add background progress reporter during sandbox Docker build (heartbeat every 30s showing elapsed time, current Docker step, and last log line) - Stream test script output to CI log via tee + capture for assertions - Filter potential secrets from progress heartbeat output
Replace bare 'brev create' + brev-setup.sh with 'brev start' using the OpenShell-Community launch-nemoclaw.sh setup script. This installs Docker, OpenShell CLI, and Node.js via the launchable's proven path, then runs 'nemoclaw onboard --non-interactive' to build the sandbox (testing whether this path is faster than our manual setup.sh). Changes: - Default CPU back to 4x16 (8x32 didn't help — bottleneck was I/O) - Launchable path: brev start + setup-script URL, poll for completion, rsync PR branch, npm ci, nemoclaw onboard - Legacy path preserved (USE_LAUNCHABLE=0) - Timestamped logging throughout for timing comparison - New use_launchable workflow input (default: true)
… mode openshell sandbox create without a command defaults to opening an interactive shell inside the sandbox. In CI (non-interactive SSH), this hangs forever — the sandbox goes Ready but the command never returns. The [?2004h] terminal escape codes in CI logs were bash enabling bracketed paste mode, waiting for input. Add --no-tty -- true so the command exits immediately after the sandbox is created and Ready.
The launchable setup script installs Node.js via nvm, which sets up PATH in ~/.nvm/nvm.sh. Non-interactive SSH doesn't source .bashrc, so npm/node commands fail with 'command not found'. Source nvm.sh before running npm in the launchable path and runRemoteTest.
setup.sh defaulted to 'nemoclaw' ignoring the NEMOCLAW_SANDBOX_NAME env var set by the CI test harness (e2e-test). Now uses $1 > $NEMOCLAW_SANDBOX_NAME > nemoclaw.
…box) The full E2E test runs install.sh --non-interactive which destroys and rebuilds the sandbox. When TEST_SUITE=all, this kills the sandbox that beforeAll created, causing credential-sanitization and telegram-injection to fail with 'sandbox not running'. Only run full E2E when TEST_SUITE=full.
On forks or before the first base-image workflow run, the GHCR base image (ghcr.io/nvidia/nemoclaw/sandbox-base:latest) doesn't exist. This causes the Dockerfile's FROM to fail. Now setup.sh checks for the base image and builds Dockerfile.base locally if needed. On subsequent builds, Docker layer cache makes this near-instant. Once the GHCR base image is available, this becomes a no-op (docker pull succeeds and the local build is skipped).
brev-setup.sh creates the sandbox but doesn't install the host-side nemoclaw CLI that test scripts need for 'nemoclaw <name> status'. Add npm install + build + link step after bootstrap.
setup.sh creates the sandbox via openshell directly but never writes ~/.nemoclaw/sandboxes.json. The security test scripts check `nemoclaw <name> status` which reads the registry, causing all E2E runs to fail with 'Sandbox e2e-test not running'. Write the registry entry after nemoclaw CLI install so the test scripts can find the sandbox.
C7 greps for 'npm_' inside the sandbox and false-positives on nemoclaw-blueprint/policies/presets/npm.yaml which contains rule names like 'npm_yarn', not actual credentials. Filter out /policies/ paths from all three pattern checks.
Document what each test_suite option runs so maintainers can make an informed choice from the Actions UI without reading the test scripts.
Re-enable the github.repository check so the workflow only runs on NVIDIA/NemoClaw, not on forks.
…nv var
setup.sh now uses ${1:-${NEMOCLAW_SANDBOX_NAME:-nemoclaw}} instead of
${1:-nemoclaw}. Update the test to match and add coverage for the env
var fallback path.
…s, shell injection in test - Bump e2e-brev workflow timeout-minutes from 60 to 90 - Add fail-fast when launchable setup exceeds 40-min wait - Add pipefail to remote pipeline commands in runRemoteTest and npm ci - Fix backtick shell injection in validateName test loop (use process.argv) - Make sandbox_exec fail closed with __PROBE_FAILED__ sentinel - Add probe failure checks in C6/C7 sandbox assertions
b86c294 to
32687e3
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
test/e2e/test-credential-sanitization.sh (1)
226-290:⚠️ Potential issue | 🟠 MajorExercise NemoClaw’s real sanitizer and digest-verification paths.
These
node -eblocks still define local credential-stripping, symlink-walk, and digest-verification helpers, so the suite can pass without touching the code it is supposed to protect. The copy has already drifted from production —keyRefis omitted fromCREDENTIAL_FIELDShere and C13 treats preserving it as success — which means these checks can miss real regressions and also report correct shipped behavior as broken. Please drive C1-C5/C8/C9-C13 through the actual migration-state / blueprint verification entrypoints instead of inline helpers.Based on learnings,
keyRefis intentionally included inCREDENTIAL_FIELDSinnemoclaw/src/commands/migration-state.tsbecause it currently holds actual secret material.Also applies to: 455-484, 502-542, 559-606, 622-752
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-credential-sanitization.sh`:
- Around line 331-367: The test currently shells out to python3 to parse JSON
(variables nvidia_apikey, gateway_token, model_primary, gateway_mode), which
adds an undeclared Python dependency; replace each echo "$config_content" |
python3 -c "..." block with an echo "$config_content" | node -e "const
d=require('fs').readFileSync(0,'utf8'); try{const cfg=JSON.parse(d);
console.log(<field-expr>)}catch(e){process.exitCode=1}" pattern so Node parses
the JSON and prints the same fields (use cfg.nvidia?.apiKey ?? 'MISSING',
cfg.gateway?.auth?.token ?? 'MISSING', cfg.agents?.defaults?.model?.primary ??
'MISSING', cfg.gateway?.mode ?? 'MISSING'), preserving the existing || echo
"PARSE_ERROR" fallback behavior; update the four spots that currently set
nvidia_apikey, gateway_token, model_primary, and gateway_mode.
- Around line 75-102: The remote probe pipelines run under SSH without pipefail
so earlier stages (find/grep) can fail while the final command (head) returns 0,
producing empty output and a false "no leaks" result; update sandbox_exec to
execute the remote command under a shell with pipefail (e.g., run the provided
"$cmd" via bash -o pipefail -c ...) so any pipeline failure returns non‑zero and
causes sandbox_exec to emit the "__PROBE_FAILED__" sentinel, and apply the same
change to the other probe invocations that use pipeline commands (the other
sandbox probe blocks referenced).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 87d5e831-522e-4128-bb20-09465d31d7c6
📒 Files selected for processing (7)
.github/workflows/e2e-brev.yamlscripts/brev-setup.shscripts/setup.shtest/e2e/brev-e2e.test.jstest/e2e/test-credential-sanitization.shtest/e2e/test-telegram-injection.shtest/setup-sandbox-name.test.js
✅ Files skipped from review due to trivial changes (1)
- test/e2e/brev-e2e.test.js
🚧 Files skipped from review as they are similar to previous changes (4)
- scripts/brev-setup.sh
- .github/workflows/e2e-brev.yaml
- test/e2e/test-telegram-injection.sh
- scripts/setup.sh
| sandbox_exec() { | ||
| local cmd="$1" | ||
| local ssh_config | ||
| ssh_config="$(mktemp)" | ||
| if ! openshell sandbox ssh-config "$SANDBOX_NAME" >"$ssh_config" 2>/dev/null; then | ||
| rm -f "$ssh_config" | ||
| echo "__PROBE_FAILED__" | ||
| return 1 | ||
| fi | ||
|
|
||
| local result | ||
| local rc=0 | ||
| result=$(timeout 60 ssh -F "$ssh_config" \ | ||
| -o StrictHostKeyChecking=no \ | ||
| -o UserKnownHostsFile=/dev/null \ | ||
| -o ConnectTimeout=10 \ | ||
| -o LogLevel=ERROR \ | ||
| "openshell-${SANDBOX_NAME}" \ | ||
| "$cmd" \ | ||
| 2>&1) || rc=$? | ||
|
|
||
| rm -f "$ssh_config" | ||
| if [ "$rc" -ne 0 ] && [ -z "$result" ]; then | ||
| echo "__PROBE_FAILED__" | ||
| return 1 | ||
| fi | ||
| echo "$result" | ||
| } |
There was a problem hiding this comment.
C6/C7 can still pass when the remote probe errors out.
The commands sent through sandbox_exec() are pipelines with stderr discarded (find ... | head, grep ... | ... | head). If find/grep fails because a target path is missing or unreadable, head still exits 0, sandbox_exec() sees empty output, and the check reports “no leaks” even though nothing was inspected. Run these probes under remote pipefail or return an explicit sentinel when any stage of the pipeline fails.
Also applies to: 408-425, 427-433
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-credential-sanitization.sh` around lines 75 - 102, The remote
probe pipelines run under SSH without pipefail so earlier stages (find/grep) can
fail while the final command (head) returns 0, producing empty output and a
false "no leaks" result; update sandbox_exec to execute the remote command under
a shell with pipefail (e.g., run the provided "$cmd" via bash -o pipefail -c
...) so any pipeline failure returns non‑zero and causes sandbox_exec to emit
the "__PROBE_FAILED__" sentinel, and apply the same change to the other probe
invocations that use pipeline commands (the other sandbox probe blocks
referenced).
| nvidia_apikey=$(echo "$config_content" | python3 -c " | ||
| import json, sys | ||
| config = json.load(sys.stdin) | ||
| print(config.get('nvidia', {}).get('apiKey', 'MISSING')) | ||
| " 2>/dev/null || echo "PARSE_ERROR") | ||
|
|
||
| gateway_token=$(echo "$config_content" | python3 -c " | ||
| import json, sys | ||
| config = json.load(sys.stdin) | ||
| print(config.get('gateway', {}).get('auth', {}).get('token', 'MISSING')) | ||
| " 2>/dev/null || echo "PARSE_ERROR") | ||
|
|
||
| if [ "$nvidia_apikey" = "[STRIPPED_BY_MIGRATION]" ]; then | ||
| pass "C3a: nvidia.apiKey replaced with sentinel" | ||
| else | ||
| fail "C3a: nvidia.apiKey not sanitized (got: $nvidia_apikey)" | ||
| fi | ||
|
|
||
| if [ "$gateway_token" = "[STRIPPED_BY_MIGRATION]" ]; then | ||
| pass "C3b: gateway.auth.token replaced with sentinel" | ||
| else | ||
| fail "C3b: gateway.auth.token not sanitized (got: $gateway_token)" | ||
| fi | ||
|
|
||
| # C4: Non-credential fields must be preserved | ||
| info "C4: Checking non-credential field preservation..." | ||
| model_primary=$(echo "$config_content" | python3 -c " | ||
| import json, sys | ||
| config = json.load(sys.stdin) | ||
| print(config.get('agents', {}).get('defaults', {}).get('model', {}).get('primary', 'MISSING')) | ||
| " 2>/dev/null || echo "PARSE_ERROR") | ||
|
|
||
| gateway_mode=$(echo "$config_content" | python3 -c " | ||
| import json, sys | ||
| config = json.load(sys.stdin) | ||
| print(config.get('gateway', {}).get('mode', 'MISSING')) | ||
| " 2>/dev/null || echo "PARSE_ERROR") |
There was a problem hiding this comment.
C3/C4 add an undeclared Python dependency.
These assertions shell out to python3, but Phase 0 never checks for it and the rest of the script already standardizes on Node. On images without Python 3, this suite will fail with PARSE_ERROR for test-infrastructure reasons rather than sanitization regressions. Either add a prerequisite check or do these JSON reads with node.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-credential-sanitization.sh` around lines 331 - 367, The test
currently shells out to python3 to parse JSON (variables nvidia_apikey,
gateway_token, model_primary, gateway_mode), which adds an undeclared Python
dependency; replace each echo "$config_content" | python3 -c "..." block with an
echo "$config_content" | node -e "const d=require('fs').readFileSync(0,'utf8');
try{const cfg=JSON.parse(d);
console.log(<field-expr>)}catch(e){process.exitCode=1}" pattern so Node parses
the JSON and prints the same fields (use cfg.nvidia?.apiKey ?? 'MISSING',
cfg.gateway?.auth?.token ?? 'MISSING', cfg.agents?.defaults?.model?.primary ??
'MISSING', cfg.gateway?.mode ?? 'MISSING'), preserving the existing || echo
"PARSE_ERROR" fallback behavior; update the four spots that currently set
nvidia_apikey, gateway_token, model_primary, and gateway_mode.
cv
left a comment
There was a problem hiding this comment.
Thanks for pushing this — the direction is good, but I don't think this is mergeable as-is.
Main issues I verified:
-
test/e2e/test-credential-sanitization.shreimplementsstripCredentials,walkAndRemoveFile, and digest verification inline instead of invoking the real code innemoclaw/src/commands/migration-state.ts. That makes large parts of the suite self-fulfilling rather than true end-to-end coverage. -
The credential suite does not appear to drive the real snapshot/restore path end-to-end, so it is not yet validating the production migration flow it claims to cover.
-
test/e2e/test-telegram-injection.shdoes not exercisescripts/telegram-bridge.js/runAgentInSandbox(). It uses ad-hoc SSH commands likeMSG=$(cat) && echo ..., which bypass the actual quoting/session-handling path we need confidence in. -
test/e2e/brev-e2e.test.jsdefaults to downloading/executing a raw setup script from a feature branch inOpenShell-Community. Please pin that to an immutable commit SHA or vendor it into this repo.
I think this is worth landing once the tests are wired to the real production paths and the launchable bootstrap is pinned. Also please rebase after #1090 / current main, since this PR currently carries the setup script changes too.
cv
left a comment
There was a problem hiding this comment.
Talked to @jyaunches offline and we're okay with the changes I requested being done in follow-up PRs
Summary
Adds an ephemeral Brev-based E2E test infrastructure for running security test suites against a live NemoClaw sandbox. Validates the fixes from PR #584 (command injection) and PR #743 (credential sanitization) end-to-end.
Validated on fork: Run #23668176481 — 24/24 credential sanitization + 18/18 telegram injection ✅
Depends on: #1090 (setup.sh CI improvements — timestamped output,
NEMOCLAW_SANDBOX_NAME,--no-tty, base image fallback)What already existed in the repo
nightly-e2e.yaml— runstest/e2e/test-full-e2e.shnightly on GitHub-hosted runners. Tests the full install → onboard → inference journey. Runs directly on the runner (no remote instance).test/e2e/test-full-e2e.sh— the existing full E2E script testing the complete user journey.vitest.config.ts— hadcliandpluginprojects.What this adds
New files
.github/workflows/e2e-brev.yamltest/e2e/brev-e2e.test.jstest/e2e/test-credential-sanitization.shtest/e2e/test-telegram-injection.sh$(cmd)substitution (T1), backtick injection (T2), quote breakout (T3),${VAR}expansion (T4), process table leak checks (T5),SANDBOX_NAMEvalidation with shell metacharacters (T6–T7), and normal message regression (T8).Modified files
vitest.config.tse2e-brevproject, gated byBREV_API_TOKENenv var so it never runs in normalnpx vitest.Key design decisions
keep_alive: true).--cpu 4x16) — OpenShell gateway and sandbox work without GPU. Security tests don't need inference.NVIDIA_API_KEYandGITHUB_TOKENare piped to the remote VM through SSH stdin, never appearing in CLI args or process listings.setup.shcreates sandboxes viaopenshelldirectly without writing to~/.nemoclaw/sandboxes.json. The test orchestrator writes the registry entry after bootstrap sonemoclaw <name> statusworks.fullsuite isolation —fullrunsinstall.sh --non-interactivewhich destroys and recreates the sandbox. The security tests need the existing sandbox, soallruns onlycredential-sanitization+telegram-injection(notfull).What needs to be in the repo to run this
Secrets (configured by repo admin)
BREV_API_TOKENbrev login, then copy therefresh_tokenfrom~/.brev/credentials.json.NVIDIA_API_KEYInfrastructure
Nemoclaw CI/CD).--cpuflag).How a maintainer uses this
Running from the Actions UI
full— complete user journey: install → onboard → inference → CLI ops (~10 min, destroys sandbox)credential-sanitization— 24 tests for migration credential stripping, digest verification, symlink safetytelegram-injection— 18 tests for shell injection prevention across all attack vectorsall— runs both security suites (notfull, which would destroy the sandbox they need)trueto SSH in after,falseto auto-cleanuppr_numberis set, a check run and comment are posted to the PR.Calling from another workflow
SSH debugging (keep_alive: true)
When
keep_aliveis enabled, the instance stays running after tests. The workflow logs and PR comment include connection instructions:Adding new test suites
To add a new E2E test suite:
Create a test script in
test/e2e/following the existing pattern (pass()/fail()/skip()functions,section()for phases, prerequisite checks in Phase 0, summary with pass/fail counts).Add the suite to
brev-e2e.test.js— add a newit.runIf(...)block:Add the choice to the workflow — add your suite name to the
optionslist ine2e-brev.yamlundertest_suite, and add a description to the workflow header comment.Note: If your test needs the sandbox created during
beforeAll, add it to theallgroup. If it creates/destroys its own sandbox (likefull), keep it separate.Files changed
Summary by CodeRabbit
New Features
Improvements
Tests