ci(nightly-e2e): remove misleading environment from cloud-experimental job by mercl-lau · Pull Request #1085 · NVIDIA/NemoClaw

mercl-lau · 2026-03-30T05:50:21Z

Summary

Remove the environment: NVIDIA_API_KEY binding from the cloud-experimental-e2e job in the nightly workflow. GitHub treats environment as a deployment target, which created a confusing Deployments page whose name looked like a secret. The job already receives the API key via env and repository secret secrets.NVIDIA_API_KEY, matching the cloud-e2e job.
Also default the cloud-experimental Phase 3 install log path to include ${UID} so root and non-root runs do not share the same file.

Related Issue

N/A (internal CI clarity / follow-up from team discussion).

Changes

.github/workflows/nightly-e2e.yaml: drop environment: NVIDIA_API_KEY from cloud-experimental-e2e.
test/e2e/test-e2e-cloud-experimental.sh: default INSTALL_LOG uses /tmp/nemoclaw-e2e-cloud-experimental-install.${UID}.log; comment updated accordingly.

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only. Prose changes without code sample modifications.
Doc only. Includes code sample changes.

Testing

npx prek run --all-files passes (or equivalently make check).
npm test passes.
make docs builds without warnings. (for doc-only changes)

Checklist

General

I have read and followed the contributing guide.
I have read and followed the style guide. (for doc-only changes)

Code Changes

Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

N/A — no doc-only section required for this PR.

Summary by CodeRabbit

New Features
- Added a Spark install smoke test and a reusable teardown/verify cleanup step for experimental e2e runs.
- Added phase-tag controls to run or skip specific e2e phases.
Tests
- Enhanced cloud-experimental e2e: configurable provider matching, ordered egress probes (PyPI/npm/Hugging Face), proxy and skip flags, and refined timeouts.
- Docs-link probing: URL ignore lists, normalization, and clearer progress/reporting.
- Increased test timeouts for a couple of unit tests.
Chores
- CI workflow: extended job timeout, adjusted artifact naming, and unconditional final cleanup/check steps.

…ental job Avoid spurious deployment records; secrets still come from secrets.NVIDIA_API_KEY. Made-with: Cursor

coderabbitai · 2026-03-30T05:50:36Z

📝 Walkthrough

Walkthrough

E2E workflow and multiple test scripts were updated: added phase-tag gating, delegated cleanup to a new teardown script, improved docs-link skipping, expanded network-policy probes, made inference-provider checks configurable, added a Spark install smoke test, and increased a few test timeouts; workflow now skips embedded cleanup/docs and runs always-on teardown steps.

Changes

Cohort / File(s)	Summary
Workflow `.github/workflows/nightly-e2e.yaml`	Removed job-level `environment: NVIDIA_API_KEY`; increased `timeout-minutes` 45→90; main run sets `RUN_E2E_CLOUD_EXPERIMENTAL_SKIP_FINAL_CLEANUP=1` and `RUN_E2E_CLOUD_EXPERIMENTAL_SKIP_CHECK_DOCS=1`; added three `if: always()` post-steps (`check-docs.sh`, `skip/05-network-policy.sh`, `cleanup.sh --verify`); changed experimental install log artifact path to `/tmp/nemoclaw-e2e-cloud-experimental-install.log`.
E2E controller `test/e2e/test-e2e-cloud-experimental.sh`	Added `e2e_cloud_experimental_phase_enabled()`; introduced `E2E_CLOUD_EXPERIMENTAL_ONLY_TAGS`/`_SKIP_TAGS` gating; most phases now conditionally skip with tag-aware messages; replaced direct pre/post teardown with calls to new `cleanup.sh` (Phase0 no-verify, Phase6 with `--verify`).
Cleanup script `test/e2e/e2e-cloud-experimental/cleanup.sh`	New teardown CLI: resolves `SANDBOX_NAME`, supports `--verify`; best-effort destroy via `nemoclaw`/`openshell` (non-fatal); `--verify` asserts sandbox/gateway absence when tools present; exits 0 on destroy or successful verify, 1 on verify failure.
Docs checker `test/e2e/e2e-cloud-experimental/check-docs.sh`	Added ignore controls `CHECK_DOC_LINKS_IGNORE_EXTRA` and `CHECK_DOC_LINKS_IGNORE_URL_REGEX`; URL normalization helper and built-in ignore list; remote-phase builds filtered probe list, logs skipped URLs, and reports counts for unique probed URLs.
Network probes `test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh`	Replaced single whitelist loop with ordered per-domain probes (PyPI, npm, Hugging Face, blocked URL); added `NEMOCLAW_E2E_CURL_NOPROXY` and `NEMOCLAW_E2E_SKIP_NETWORK_POLICY_HUGGINGFACE`; improved timeout handling and proxy-related exit-code messaging; final output reflects whether Hugging Face ran or was skipped.
Provider-configurable skips `test/e2e/e2e-cloud-experimental/skip/01-onboard-completion.sh`, `test/e2e/e2e-cloud-experimental/skip/04-nemoclaw-openshell-status-parity.sh`	Replaced hardcoded `nvidia-nim` checks with `CLOUD_EXPERIMENTAL_INFERENCE_PROVIDER` (default `nvidia-nim`); updated grep matching and failure messaging to reference configured provider.
Spark smoke test `test/e2e/test-spark-install.sh`	Added new end-to-end Spark install smoke test (strict mode, phased checks: OS, Docker, optional setup, install via public/local route); streams install log with tailing, fails with diagnostic excerpt on error, verifies `nemoclaw`/`openshell` availability.
Test timeouts (JS) `test/install-preflight.test.js`, `test/uninstall.test.js`	Added 60,000ms (60s) timeout to two `it(...)` tests.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant GitHubWorkflow as GitHub Workflow
participant Runner as Runner
participant E2E_Script as test-e2e-cloud-experimental.sh
participant Cleanup as cleanup.sh
participant OpenTools as openshell / nemoclaw
GitHubWorkflow->>Runner: start job (timeout 90m)
Runner->>E2E_Script: run main e2e with SKIP_FINAL_CLEANUP=1, SKIP_CHECK_DOCS=1
E2E_Script->>Runner: execute enabled phases (tag-gated)
Runner->>GitHubWorkflow: main step completes
GitHubWorkflow->>Runner: always-run post-step: check-docs.sh
GitHubWorkflow->>Runner: always-run post-step: skip/05-network-policy.sh
GitHubWorkflow->>Cleanup: always-run post-step: cleanup.sh --verify
Cleanup->>OpenTools: call openshell/nemoclaw cleanup and verification
OpenTools-->>Cleanup: report success/failure
Cleanup-->>GitHubWorkflow: exit code affects workflow status

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled logs and hopped through phases bright,

Skip flags tucked in pockets, cleanup done at night,
Docs left unprobed where gentle skips were planned,
Sandboxes swept tidy by a careful paw and hand,
Hooray — the nightly run hops home, all right. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: removing the misleading environment declaration from the cloud-experimental job in the nightly workflow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-e2e-cloud-experimental.sh`:
- Line 293: The workflow still uploads the old install log filename, so update
the artifact upload step to use the new INSTALL_LOG name; locate the INSTALL_LOG
variable in test/e2e/test-e2e-cloud-experimental.sh and change the corresponding
artifact path/filename in the workflow upload step (the job that uploads the
install log artifact) to /tmp/nemoclaw-e2e-cloud-experimental-install.${UID}.log
(or reference ${{ env.INSTALL_LOG }} if the workflow imports that env) so CI
preserves the correct install log on failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1e52d9c7-42f3-4d45-9c1a-465a5a9bab46

📥 Commits

Reviewing files that changed from the base of the PR and between f59f58e and d2aeb85.

📒 Files selected for processing (2)

.github/workflows/nightly-e2e.yaml
test/e2e/test-e2e-cloud-experimental.sh

💤 Files with no reviewable changes (1)

.github/workflows/nightly-e2e.yaml

test/e2e/test-e2e-cloud-experimental.sh

… log path Upload /tmp/nemoclaw-e2e-cloud-experimental-install.log on failure; script uses the same fixed default without UID.

…on + Spark E2E

…0260325 Made-with: Cursor

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

test/uninstall.test.js (1)

53-62: ⚠️ Potential issue | 🟠 Major

Add timeout protection directly to spawnSync.

Vitest's 60_000 test timeout can't interrupt spawnSync since it blocks the event loop. If the uninstall script hangs, the test suite will hang indefinitely. Apply the timeout to spawnSync itself instead.

Suggested fix

       const result = spawnSync("bash", [UNINSTALL_SCRIPT, "--yes"], {
         cwd: path.join(import.meta.dirname, ".."),
         encoding: "utf-8",
+        timeout: 60_000,
+        killSignal: "SIGKILL",
         env: {
           ...process.env,
           HOME: tmp,
           PATH: `${fakeBin}:/usr/bin:/bin`,
           SCRIPT_DIR: path.join(import.meta.dirname, ".."),

Also applies to: 72-72

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/uninstall.test.js` around lines 53 - 62, The test calls
spawnSync("bash", [UNINSTALL_SCRIPT, "--yes"], {...}) which can block forever;
add a timeout option to the spawnSync options (e.g. timeout: 60_000) so the
child will be killed if it hangs, and also include a killSignal if desired (e.g.
killSignal: "SIGKILL") to ensure termination; update both spawnSync invocations
(the one using UNINSTALL_SCRIPT and the similar call at the later occurrence) to
include the timeout (and optional killSignal) in their options object.

test/install-preflight.test.js (1)

195-206: ⚠️ Potential issue | 🟠 Major

Add timeout and killSignal to the spawnSync call to enforce the test's 60-second boundary.

Vitest's per-test timeout cannot preempt a blocking spawnSync call. If the bash installer hangs, the event loop remains blocked and the test runner timeout becomes ineffective. The child process timeout must be set explicitly on spawnSync.

Suggested fix

     const result = spawnSync("bash", [INSTALLER], {
       cwd: tmp,
       encoding: "utf-8",
+      timeout: 60_000,
+      killSignal: "SIGKILL",
       env: {
         ...process.env,
         HOME: tmp,
         PATH: `${fakeBin}:${TEST_SYSTEM_PATH}`,

Also applies to: 210-210

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/install-preflight.test.js` around lines 195 - 206, The spawnSync
invocation that runs the installer in test/install-preflight.test.js needs
explicit child process timeouts: add timeout: 60000 and killSignal: 'SIGKILL'
(or 'SIGTERM') to the options object passed to spawnSync (the call using
INSTALLER, cwd: tmp, env including HOME, PATH, NEMOCLAW_NON_INTERACTIVE,
NPM_PREFIX, GIT_LOG_PATH). Update both occurrences of the spawnSync call (the
one around the current block and the second occurrence noted) so the synchronous
child will be forcibly killed after 60 seconds instead of blocking the test
runner.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/e2e-cloud-experimental/cleanup.sh`:
- Around line 50-55: The current verification treats any nonzero exit from
"openshell sandbox get" and skipped failures from "nemoclaw list" as proof the
sandbox is gone; change both checks to positively detect a "not found" response
and otherwise fail: run "openshell sandbox get $SANDBOX_NAME" capturing
stdout/stderr and exit code, and only treat it as success when the output (or
stderr) contains a canonical not-found string (e.g. "not found", "No such
sandbox", or the provider-specific message) — on any other nonzero exit or
ambiguous output, call fail/exit with an error; likewise run "nemoclaw list" and
parse its successful output for SANDBOX_NAME (treat absence as not found) but if
the list command itself errors or returns ambiguous output, treat that as a test
failure rather than success.
- Around line 28-31: The script currently treats any argument other than the
exact string "--verify" as no-ops and silently leaves VERIFY=0; change the
argument handling to explicitly reject unexpected args: check the positional
parameters (e.g., test "$#" and "$1") and if an argument is present and not
equal to "--verify" print an error to stderr and exit non-zero, otherwise set
VERIFY=1 only when "$1" == "--verify"; update the logic around the VERIFY
variable and the if condition that reads "$1" so typos or extra args cause a
failing exit instead of silently downgrading to destroy-only.

In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh`:
- Around line 349-355: Section A's YAML assertions need to include explicit
checks for Hugging Face hosts because the script now enforces Hugging Face
egress by default; update the assertions that currently only verify npm/PyPI
entries to also assert the expected Hugging Face hub/CDN entries are present in
policy_output. Specifically, add assertions (guarded by SKIP_HUGGINGFACE similar
to run_whitelist_huggingface_via_cli) that policy_output contains the Hugging
Face hostnames such as huggingface.co and the CDN/LFS domains used by the hub
(e.g., cdn.huggingface.co / cdn-lfs.huggingface.co or your project's canonical
HF hostnames) so the test fails if those egress rules are not explicitly
declared. Ensure these checks live alongside the existing npm/PyPI checks and
reference the same policy_output variable used in the script.

In `@test/e2e/test-spark-install.sh`:
- Around line 111-125: Wrap the background installer invocation with a bounded
timeout (use timeout or gtimeout) instead of starting curl|bash or install.sh
directly so wait "$install_pid" cannot hang; e.g., use a configurable
INSTALL_TIMEOUT env var and run either "timeout $INSTALL_TIMEOUT curl -fsSL
\"$url\" | bash" or "timeout $INSTALL_TIMEOUT bash install.sh --non-interactive"
(preserving INSTALL_LOG redirection and backgrounding), keep the tail logic for
INSTALL_LOG/tail_pid, capture the timeout exit code into install_exit, and
ensure you kill/wait the tail like existing logic; update references to
NEMOCLAW_E2E_PUBLIC_INSTALL, url, INSTALL_LOG, install_pid, tail_pid
accordingly.
- Around line 57-58: The INSTALL_LOG default uses a fixed /tmp path which can
collide across runs; update the default in test-spark-install.sh so INSTALL_LOG
is unique per process (e.g., use mktemp or include the shell $UID/$PPID/PID in
the filename) and ensure SETUP_SCRIPT and any subsequent code that tails or
writes INSTALL_LOG continue to reference the new variable name; replace the
fixed "/tmp/nemoclaw-e2e-spark-install.log" with a dynamically generated name
(mktemp or "/tmp/nemoclaw-e2e-spark-install.${UID:-$$}.log") so concurrent or
root/non-root runs won't clobber each other.
- Around line 96-107: The setup step can hang waiting for a sudo password;
update the invocation that runs the setup script in test-spark-install.sh to use
non-interactive sudo by adding the -n flag (i.e., replace the plain sudo call
that runs "$SETUP_SCRIPT" with sudo -n) so it fails fast when no password is
available; keep the existing pass/fail handling for
NEMOCLAW_E2E_SPARK_SKIP_SETUP and the SETUP_SCRIPT variable unchanged.

---

Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 195-206: The spawnSync invocation that runs the installer in
test/install-preflight.test.js needs explicit child process timeouts: add
timeout: 60000 and killSignal: 'SIGKILL' (or 'SIGTERM') to the options object
passed to spawnSync (the call using INSTALLER, cwd: tmp, env including HOME,
PATH, NEMOCLAW_NON_INTERACTIVE, NPM_PREFIX, GIT_LOG_PATH). Update both
occurrences of the spawnSync call (the one around the current block and the
second occurrence noted) so the synchronous child will be forcibly killed after
60 seconds instead of blocking the test runner.

In `@test/uninstall.test.js`:
- Around line 53-62: The test calls spawnSync("bash", [UNINSTALL_SCRIPT,
"--yes"], {...}) which can block forever; add a timeout option to the spawnSync
options (e.g. timeout: 60_000) so the child will be killed if it hangs, and also
include a killSignal if desired (e.g. killSignal: "SIGKILL") to ensure
termination; update both spawnSync invocations (the one using UNINSTALL_SCRIPT
and the similar call at the later occurrence) to include the timeout (and
optional killSignal) in their options object.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a82b76c5-5ee8-4b3e-a01f-ccbc6a9fd7c2

📥 Commits

Reviewing files that changed from the base of the PR and between f82f048 and 89988a5.

📒 Files selected for processing (10)

.github/workflows/nightly-e2e.yaml
test/e2e/e2e-cloud-experimental/check-docs.sh
test/e2e/e2e-cloud-experimental/cleanup.sh
test/e2e/e2e-cloud-experimental/skip/01-onboard-completion.sh
test/e2e/e2e-cloud-experimental/skip/04-nemoclaw-openshell-status-parity.sh
test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh
test/e2e/test-e2e-cloud-experimental.sh
test/e2e/test-spark-install.sh
test/install-preflight.test.js
test/uninstall.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/nightly-e2e.yaml

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/e2e-cloud-experimental/cleanup.sh

+VERIFY=0
+if [ "${1:-}" = "--verify" ]; then
+  VERIFY=1
+fi


⚠️ Potential issue | 🟠 Major

Reject unexpected args instead of silently downgrading to destroy-only.

Anything other than exact --verify currently falls through to the success path. A typo here disables the verification gate without failing CI.

Suggested fix

VERIFY=0 -if [ "${1:-}" = "--verify" ]; then - VERIFY=1 -fi +case "${1:-}" in + "") + ;; + --verify) + VERIFY=1 + ;; + *) + printf '%s\n' "cleanup.sh: unknown option: ${1}" >&2 + exit 2 + ;; +esac

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

VERIFY=0

if [ "${1:-}" = "--verify" ]; then

VERIFY=1

fi

VERIFY=0

case "${1:-}" in

"")

;;

--verify)

VERIFY=1

;;

*)

printf '%s\n' "cleanup.sh: unknown option: ${1}" >&2

exit 2

;;

esac

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/e2e-cloud-experimental/cleanup.sh` around lines 28 - 31, The script currently treats any argument other than the exact string "--verify" as no-ops and silently leaves VERIFY=0; change the argument handling to explicitly reject unexpected args: check the positional parameters (e.g., test "$#" and "$1") and if an argument is present and not equal to "--verify" print an error to stderr and exit non-zero, otherwise set VERIFY=1 only when "$1" == "--verify"; update the logic around the VERIFY variable and the if condition that reads "$1" so typos or extra args cause a failing exit instead of silently downgrading to destroy-only.

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/e2e-cloud-experimental/cleanup.sh

+if command -v openshell >/dev/null 2>&1; then
+  if openshell sandbox get "$SANDBOX_NAME" >/dev/null 2>&1; then
+    fail "openshell sandbox get '${SANDBOX_NAME}' still succeeds after cleanup"
+    exit 1
+  fi
+  pass "openshell: sandbox '${SANDBOX_NAME}' no longer visible to sandbox get"


⚠️ Potential issue | 🟠 Major

Don't treat CLI failures as proof the sandbox is gone.

openshell sandbox get can fail for auth/API/runtime reasons, and nemoclaw list failures are skipped entirely. In --verify mode that can report green while the sandbox still exists. Please fail unless you can positively identify a real “not found” result.

Also applies to: 60-73

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/e2e-cloud-experimental/cleanup.sh` around lines 50 - 55, The current verification treats any nonzero exit from "openshell sandbox get" and skipped failures from "nemoclaw list" as proof the sandbox is gone; change both checks to positively detect a "not found" response and otherwise fail: run "openshell sandbox get $SANDBOX_NAME" capturing stdout/stderr and exit code, and only treat it as success when the output (or stderr) contains a canonical not-found string (e.g. "not found", "No such sandbox", or the provider-specific message) — on any other nonzero exit or ambiguous output, call fail/exit with an error; likewise run "nemoclaw list" and parse its successful output for SANDBOX_NAME (treat absence as not found) but if the list command itself errors or returns ambiguous output, treat that as a test failure rather than success.

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh

+run_whitelist_pypi_via_venv
+run_whitelist_npm_via_cli
+if [[ "$SKIP_HUGGINGFACE" == "1" ]]; then
+  printf '%s\n' "05-network-policy: SKIP huggingface whitelist (NEMOCLAW_E2E_SKIP_NETWORK_POLICY_HUGGINGFACE=1)"
+else
+  run_whitelist_huggingface_via_cli
+fi


⚠️ Potential issue | 🟠 Major

Mirror the new Hugging Face probe in the YAML assertions.

This script now enforces Hugging Face egress by default, but section A still only proves npm/PyPI hosts are declared in policy_output. That means this can pass via an overly broad rule and still claim the declared policy is correct. Please add explicit checks for the expected Hugging Face hub/CDN entries too.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh` around lines 349 - 355, Section A's YAML assertions need to include explicit checks for Hugging Face hosts because the script now enforces Hugging Face egress by default; update the assertions that currently only verify npm/PyPI entries to also assert the expected Hugging Face hub/CDN entries are present in policy_output. Specifically, add assertions (guarded by SKIP_HUGGINGFACE similar to run_whitelist_huggingface_via_cli) that policy_output contains the Hugging Face hostnames such as huggingface.co and the CDN/LFS domains used by the hub (e.g., cdn.huggingface.co / cdn-lfs.huggingface.co or your project's canonical HF hostnames) so the test fails if those egress rules are not explicitly declared. Ensure these checks live alongside the existing npm/PyPI checks and reference the same policy_output variable used in the script.

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/test-spark-install.sh

+SETUP_SCRIPT="$REPO/scripts/setup-spark.sh"
+INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}"


⚠️ Potential issue | 🟡 Minor

Default INSTALL_LOG can collide across runs.

The fixed /tmp/nemoclaw-e2e-spark-install.log path is shared by root/non-root and concurrent jobs, so another run can clobber the file you're tailing. Using $UID or mktemp would avoid the same collision class this PR already fixed for the cloud-experimental installer log.

Suggested fix

-INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}" +INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.${UID}.log}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SETUP_SCRIPT="$REPO/scripts/setup-spark.sh"

INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}"

SETUP_SCRIPT="$REPO/scripts/setup-spark.sh"

INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.${UID}.log}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-spark-install.sh` around lines 57 - 58, The INSTALL_LOG default uses a fixed /tmp path which can collide across runs; update the default in test-spark-install.sh so INSTALL_LOG is unique per process (e.g., use mktemp or include the shell $UID/$PPID/PID in the filename) and ensure SETUP_SCRIPT and any subsequent code that tails or writes INSTALL_LOG continue to reference the new variable name; replace the fixed "/tmp/nemoclaw-e2e-spark-install.log" with a dynamically generated name (mktemp or "/tmp/nemoclaw-e2e-spark-install.${UID:-$$}.log") so concurrent or root/non-root runs won't clobber each other.

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/test-spark-install.sh

+if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then
+  info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)"
+  pass "setup-spark skipped"
+else
+  info "Running: sudo bash scripts/setup-spark.sh"
+  if sudo bash "$SETUP_SCRIPT"; then
+    pass "setup-spark completed"
+  else
+    fail "setup-spark failed"
+    exit 1
+  fi
+fi


⚠️ Potential issue | 🟠 Major

Make the setup step fail fast with sudo -n.

This smoke test is explicitly non-interactive, but plain sudo can block on a password/TTY prompt. In CI that turns into a hang instead of a clear prerequisite failure.

Suggested fix

- info "Running: sudo bash scripts/setup-spark.sh" - if sudo bash "$SETUP_SCRIPT"; then + info "Running: sudo -n bash scripts/setup-spark.sh" + if sudo -n bash "$SETUP_SCRIPT"; then pass "setup-spark completed" else - fail "setup-spark failed" + fail "setup-spark failed (passwordless sudo is required for this smoke test)" exit 1 fi fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then

info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)"

pass "setup-spark skipped"

else

info "Running: sudo bash scripts/setup-spark.sh"

if sudo bash "$SETUP_SCRIPT"; then

pass "setup-spark completed"

else

fail "setup-spark failed"

exit 1

fi

fi

if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then

info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)"

pass "setup-spark skipped"

else

info "Running: sudo -n bash scripts/setup-spark.sh"

if sudo -n bash "$SETUP_SCRIPT"; then

pass "setup-spark completed"

else

fail "setup-spark failed (passwordless sudo is required for this smoke test)"

exit 1

fi

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-spark-install.sh` around lines 96 - 107, The setup step can hang waiting for a sudo password; update the invocation that runs the setup script in test-spark-install.sh to use non-interactive sudo by adding the -n flag (i.e., replace the plain sudo call that runs "$SETUP_SCRIPT" with sudo -n) so it fails fast when no password is available; keep the existing pass/fail handling for NEMOCLAW_E2E_SPARK_SKIP_SETUP and the SETUP_SCRIPT variable unchanged.

coderabbitai · 2026-03-30T12:50:18Z

test/e2e/test-spark-install.sh

+if [ "${NEMOCLAW_E2E_PUBLIC_INSTALL:-0}" = "1" ]; then
+  url="${NEMOCLAW_INSTALL_SCRIPT_URL:-https://www.nvidia.com/nemoclaw.sh}"
+  info "Running: curl -fsSL ... | bash (url=$url)"
+  curl -fsSL "$url" | bash >"$INSTALL_LOG" 2>&1 &
+else
+  info "Running: bash install.sh --non-interactive"
+  bash install.sh --non-interactive >"$INSTALL_LOG" 2>&1 &
+fi
+install_pid=$!
+tail -f "$INSTALL_LOG" --pid=$install_pid 2>/dev/null &
+tail_pid=$!
+wait "$install_pid"
+install_exit=$?
+kill "$tail_pid" 2>/dev/null || true
+wait "$tail_pid" 2>/dev/null || true


⚠️ Potential issue | 🟠 Major

Bound the background installer with a real timeout.

wait "$install_pid" will sit until the job-level timeout if curl or install.sh stalls. Please wrap the install command in timeout/gtimeout so this phase fails with a log excerpt instead of hanging the runner.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-spark-install.sh` around lines 111 - 125, Wrap the background installer invocation with a bounded timeout (use timeout or gtimeout) instead of starting curl|bash or install.sh directly so wait "$install_pid" cannot hang; e.g., use a configurable INSTALL_TIMEOUT env var and run either "timeout $INSTALL_TIMEOUT curl -fsSL \"$url\" | bash" or "timeout $INSTALL_TIMEOUT bash install.sh --non-interactive" (preserving INSTALL_LOG redirection and backgrounding), keep the tail logic for INSTALL_LOG/tail_pid, capture the timeout exit code into install_exit, and ensure you kill/wait the tail like existing logic; update references to NEMOCLAW_E2E_PUBLIC_INSTALL, url, INSTALL_LOG, install_pid, tail_pid accordingly.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh`:
- Around line 185-186: The pip probe currently can hit a warmed cache; modify
the python -m pip download invocation (the command that downloads idna==3.7) to
add the --no-cache-dir flag so it cannot use ~/.cache/pip, and for the Hugging
Face probe block (the HF download commands) set HF_HOME to a fresh temporary
directory (e.g., create and export a unique temp dir via mktemp -d) immediately
before those HF download commands and clean it up afterward so HF tools use an
empty cache and must perform real egress.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a2f3029-397d-4b90-b743-5170428240a9

📥 Commits

Reviewing files that changed from the base of the PR and between 89988a5 and b20c809.

📒 Files selected for processing (1)

test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh

coderabbitai · 2026-03-30T13:12:05Z

test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh

+if ! python -m pip download --no-deps --disable-pip-version-check -d "$PROBE_DL" --timeout 90 idna==3.7; then
+  echo "pypi whitelist: pip download idna==3.7 from PyPI failed (egress / proxy / policy)"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh

Repository: NVIDIA/NemoClaw

Length of output: 120

🏁 Script executed:

cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '180,190p'

Repository: NVIDIA/NemoClaw

Length of output: 518

🏁 Script executed:

cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '275,290p'

Repository: NVIDIA/NemoClaw

Length of output: 631

🏁 Script executed:

cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '300,330p'

Repository: NVIDIA/NemoClaw

Length of output: 1456

🏁 Script executed:

# Check if there are any environment variables or flags that might already disable caching rg -i "no-cache|cache|HF_HOME|HF_HUB_CACHE" test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh

Repository: NVIDIA/NemoClaw

Length of output: 41

🌐 Web query:

pip download default cache behavior --no-cache-dir

💡 Result:

By default, pip download uses pip's cache (enabled since pip 6.0) to store HTTP responses and locally built wheels, avoiding redundant downloads and builds. The cache is located at ~/.cache/pip (Linux, respects XDG_CACHE_HOME), ~/Library/Caches/pip (macOS), or %LocalAppData%\pip\Cache (Windows). Use pip cache dir to find the exact path. The --no-cache-dir option disables this caching entirely, forcing fresh downloads every time. This is useful in scenarios like Docker builds to reduce image size or ensure fresh packages, but it's not recommended generally as it slows down operations by repeating downloads and builds.

Citations:

1: https://pip.pypa.io/en/stable/topics/caching/

2: https://pip.pypa.io/en/stable/topics/caching.html

3: https://pip.pypa.io/en/stable/cli/pip/?highlight=--no-cache-dir

🌐 Web query:

Hugging Face HF_HOME environment variable cache location

💡 Result:

The HF_HOME environment variable specifies the base directory where huggingface_hub stores local data, including tokens and various caches. It defaults to ~/.cache/huggingface (or $XDG_CACHE_HOME/huggingface if XDG_CACHE_HOME is set and HF_HOME is not). Key cache locations under HF_HOME: - Hub cache (models, datasets, spaces): $HF_HOME/hub (default: ~/.cache/huggingface/hub) - Token: $HF_HOME/token - Xet cache: $HF_HOME/xet - Assets cache: $HF_HOME/assets To use it, set the environment variable before importing huggingface_hub, e.g.: export HF_HOME=/path/to/custom/dir or in Python: import os os.environ['HF_HOME'] = '/path/to/custom/dir' Note: HF_HUB_CACHE can override just the hub cache specifically, defaulting to $HF_HOME/hub.

Citations:

1: https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables

2: https://huggingface.co/docs/huggingface_hub/main/package_reference/environment_variables

3: https://huggingface.co/docs/huggingface_hub/v0.25.0/guides/manage-cache

4: https://huggingface.co/docs/hub/main/local-cache

5: https://huggingface.co/docs/datasets/en/cache

Disable caches for PyPI and Hugging Face probes to ensure egress testing is reliable.

The pip download command reuses the default pip cache at ~/.cache/pip, and HF tools reuse the cache at ~/.cache/huggingface. Since this script runs late in the sandbox, previous operations may have warmed these caches, allowing downloads to succeed without actual egress. Add --no-cache-dir to line 185, and set HF_HOME to a fresh temporary directory before the HF download commands (lines 305–324) to force fresh network requests.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh` around lines 185 - 186, The pip probe currently can hit a warmed cache; modify the python -m pip download invocation (the command that downloads idna==3.7) to add the --no-cache-dir flag so it cannot use ~/.cache/pip, and for the Hugging Face probe block (the HF download commands) set HF_HOME to a fresh temporary directory (e.g., create and export a unique temp dir via mktemp -d) immediately before those HF download commands and clean it up afterward so HF tools use an empty cache and must perform real egress.

ci(nightly-e2e): remove environment NVIDIA_API_KEY from cloud-experim…

d2aeb85

…ental job Avoid spurious deployment records; secrets still come from secrets.NVIDIA_API_KEY. Made-with: Cursor

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

test/e2e/test-e2e-cloud-experimental.sh Outdated Show resolved Hide resolved

mercl-lau and others added 2 commits March 30, 2026 13:57

fix(ci): match cloud-experimental nightly artifact to default install…

f82f048

… log path Upload /tmp/nemoclaw-e2e-cloud-experimental-install.log on failure; script uses the same fixed default without UID.

Merge branch 'main' into e2e-test-20260325

bff9891

cv approved these changes Mar 30, 2026

View reviewed changes

mercl-lau added 3 commits March 30, 2026 19:52

ci(test): split nightly cloud-experimental, shared cleanup, integrati…

cb016a9

…on + Spark E2E

Merge remote-tracking branch 'fork/e2e-test-20260325' into e2e-test-2…

e13be03

…0260325 Made-with: Cursor

test(cli): extend timeout for install fallback and uninstall --yes

89988a5

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

ci(test): skip huggingface policy add and check

b20c809

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

Merge branch 'main' into e2e-test-20260325

6fa3e82

cv merged commit 5ac8313 into NVIDIA:main Mar 30, 2026
8 checks passed

wscurran added the CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. label Mar 30, 2026

		SETUP_SCRIPT="$REPO/scripts/setup-spark.sh"
		INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}"

		if ! python -m pip download --no-deps --disable-pip-version-check -d "$PROBE_DL" --timeout 90 idna==3.7; then
		echo "pypi whitelist: pip download idna==3.7 from PyPI failed (egress / proxy / policy)"

Conversation

mercl-lau commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Testing

Checklist

General

Code Changes

Doc Changes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mercl-lau commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading