ci(nightly-e2e): remove misleading environment from cloud-experimental job#1085
ci(nightly-e2e): remove misleading environment from cloud-experimental job#1085cv merged 8 commits intoNVIDIA:mainfrom
Conversation
…ental job Avoid spurious deployment records; secrets still come from secrets.NVIDIA_API_KEY. Made-with: Cursor
📝 WalkthroughWalkthroughE2E workflow and multiple test scripts were updated: added phase-tag gating, delegated cleanup to a new teardown script, improved docs-link skipping, expanded network-policy probes, made inference-provider checks configurable, added a Spark install smoke test, and increased a few test timeouts; workflow now skips embedded cleanup/docs and runs always-on teardown steps. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-e2e-cloud-experimental.sh`:
- Line 293: The workflow still uploads the old install log filename, so update
the artifact upload step to use the new INSTALL_LOG name; locate the INSTALL_LOG
variable in test/e2e/test-e2e-cloud-experimental.sh and change the corresponding
artifact path/filename in the workflow upload step (the job that uploads the
install log artifact) to /tmp/nemoclaw-e2e-cloud-experimental-install.${UID}.log
(or reference ${{ env.INSTALL_LOG }} if the workflow imports that env) so CI
preserves the correct install log on failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1e52d9c7-42f3-4d45-9c1a-465a5a9bab46
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-e2e-cloud-experimental.sh
💤 Files with no reviewable changes (1)
- .github/workflows/nightly-e2e.yaml
… log path Upload /tmp/nemoclaw-e2e-cloud-experimental-install.log on failure; script uses the same fixed default without UID.
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/uninstall.test.js (1)
53-62:⚠️ Potential issue | 🟠 MajorAdd timeout protection directly to
spawnSync.Vitest's 60_000 test timeout can't interrupt
spawnSyncsince it blocks the event loop. If the uninstall script hangs, the test suite will hang indefinitely. Apply the timeout tospawnSyncitself instead.Suggested fix
const result = spawnSync("bash", [UNINSTALL_SCRIPT, "--yes"], { cwd: path.join(import.meta.dirname, ".."), encoding: "utf-8", + timeout: 60_000, + killSignal: "SIGKILL", env: { ...process.env, HOME: tmp, PATH: `${fakeBin}:/usr/bin:/bin`, SCRIPT_DIR: path.join(import.meta.dirname, ".."),Also applies to: 72-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/uninstall.test.js` around lines 53 - 62, The test calls spawnSync("bash", [UNINSTALL_SCRIPT, "--yes"], {...}) which can block forever; add a timeout option to the spawnSync options (e.g. timeout: 60_000) so the child will be killed if it hangs, and also include a killSignal if desired (e.g. killSignal: "SIGKILL") to ensure termination; update both spawnSync invocations (the one using UNINSTALL_SCRIPT and the similar call at the later occurrence) to include the timeout (and optional killSignal) in their options object.test/install-preflight.test.js (1)
195-206:⚠️ Potential issue | 🟠 MajorAdd
timeoutandkillSignalto thespawnSynccall to enforce the test's 60-second boundary.Vitest's per-test timeout cannot preempt a blocking
spawnSynccall. If the bash installer hangs, the event loop remains blocked and the test runner timeout becomes ineffective. The child process timeout must be set explicitly onspawnSync.Suggested fix
const result = spawnSync("bash", [INSTALLER], { cwd: tmp, encoding: "utf-8", + timeout: 60_000, + killSignal: "SIGKILL", env: { ...process.env, HOME: tmp, PATH: `${fakeBin}:${TEST_SYSTEM_PATH}`,Also applies to: 210-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/install-preflight.test.js` around lines 195 - 206, The spawnSync invocation that runs the installer in test/install-preflight.test.js needs explicit child process timeouts: add timeout: 60000 and killSignal: 'SIGKILL' (or 'SIGTERM') to the options object passed to spawnSync (the call using INSTALLER, cwd: tmp, env including HOME, PATH, NEMOCLAW_NON_INTERACTIVE, NPM_PREFIX, GIT_LOG_PATH). Update both occurrences of the spawnSync call (the one around the current block and the second occurrence noted) so the synchronous child will be forcibly killed after 60 seconds instead of blocking the test runner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/e2e-cloud-experimental/cleanup.sh`:
- Around line 50-55: The current verification treats any nonzero exit from
"openshell sandbox get" and skipped failures from "nemoclaw list" as proof the
sandbox is gone; change both checks to positively detect a "not found" response
and otherwise fail: run "openshell sandbox get $SANDBOX_NAME" capturing
stdout/stderr and exit code, and only treat it as success when the output (or
stderr) contains a canonical not-found string (e.g. "not found", "No such
sandbox", or the provider-specific message) — on any other nonzero exit or
ambiguous output, call fail/exit with an error; likewise run "nemoclaw list" and
parse its successful output for SANDBOX_NAME (treat absence as not found) but if
the list command itself errors or returns ambiguous output, treat that as a test
failure rather than success.
- Around line 28-31: The script currently treats any argument other than the
exact string "--verify" as no-ops and silently leaves VERIFY=0; change the
argument handling to explicitly reject unexpected args: check the positional
parameters (e.g., test "$#" and "$1") and if an argument is present and not
equal to "--verify" print an error to stderr and exit non-zero, otherwise set
VERIFY=1 only when "$1" == "--verify"; update the logic around the VERIFY
variable and the if condition that reads "$1" so typos or extra args cause a
failing exit instead of silently downgrading to destroy-only.
In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh`:
- Around line 349-355: Section A's YAML assertions need to include explicit
checks for Hugging Face hosts because the script now enforces Hugging Face
egress by default; update the assertions that currently only verify npm/PyPI
entries to also assert the expected Hugging Face hub/CDN entries are present in
policy_output. Specifically, add assertions (guarded by SKIP_HUGGINGFACE similar
to run_whitelist_huggingface_via_cli) that policy_output contains the Hugging
Face hostnames such as huggingface.co and the CDN/LFS domains used by the hub
(e.g., cdn.huggingface.co / cdn-lfs.huggingface.co or your project's canonical
HF hostnames) so the test fails if those egress rules are not explicitly
declared. Ensure these checks live alongside the existing npm/PyPI checks and
reference the same policy_output variable used in the script.
In `@test/e2e/test-spark-install.sh`:
- Around line 111-125: Wrap the background installer invocation with a bounded
timeout (use timeout or gtimeout) instead of starting curl|bash or install.sh
directly so wait "$install_pid" cannot hang; e.g., use a configurable
INSTALL_TIMEOUT env var and run either "timeout $INSTALL_TIMEOUT curl -fsSL
\"$url\" | bash" or "timeout $INSTALL_TIMEOUT bash install.sh --non-interactive"
(preserving INSTALL_LOG redirection and backgrounding), keep the tail logic for
INSTALL_LOG/tail_pid, capture the timeout exit code into install_exit, and
ensure you kill/wait the tail like existing logic; update references to
NEMOCLAW_E2E_PUBLIC_INSTALL, url, INSTALL_LOG, install_pid, tail_pid
accordingly.
- Around line 57-58: The INSTALL_LOG default uses a fixed /tmp path which can
collide across runs; update the default in test-spark-install.sh so INSTALL_LOG
is unique per process (e.g., use mktemp or include the shell $UID/$PPID/PID in
the filename) and ensure SETUP_SCRIPT and any subsequent code that tails or
writes INSTALL_LOG continue to reference the new variable name; replace the
fixed "/tmp/nemoclaw-e2e-spark-install.log" with a dynamically generated name
(mktemp or "/tmp/nemoclaw-e2e-spark-install.${UID:-$$}.log") so concurrent or
root/non-root runs won't clobber each other.
- Around line 96-107: The setup step can hang waiting for a sudo password;
update the invocation that runs the setup script in test-spark-install.sh to use
non-interactive sudo by adding the -n flag (i.e., replace the plain sudo call
that runs "$SETUP_SCRIPT" with sudo -n) so it fails fast when no password is
available; keep the existing pass/fail handling for
NEMOCLAW_E2E_SPARK_SKIP_SETUP and the SETUP_SCRIPT variable unchanged.
---
Outside diff comments:
In `@test/install-preflight.test.js`:
- Around line 195-206: The spawnSync invocation that runs the installer in
test/install-preflight.test.js needs explicit child process timeouts: add
timeout: 60000 and killSignal: 'SIGKILL' (or 'SIGTERM') to the options object
passed to spawnSync (the call using INSTALLER, cwd: tmp, env including HOME,
PATH, NEMOCLAW_NON_INTERACTIVE, NPM_PREFIX, GIT_LOG_PATH). Update both
occurrences of the spawnSync call (the one around the current block and the
second occurrence noted) so the synchronous child will be forcibly killed after
60 seconds instead of blocking the test runner.
In `@test/uninstall.test.js`:
- Around line 53-62: The test calls spawnSync("bash", [UNINSTALL_SCRIPT,
"--yes"], {...}) which can block forever; add a timeout option to the spawnSync
options (e.g. timeout: 60_000) so the child will be killed if it hangs, and also
include a killSignal if desired (e.g. killSignal: "SIGKILL") to ensure
termination; update both spawnSync invocations (the one using UNINSTALL_SCRIPT
and the similar call at the later occurrence) to include the timeout (and
optional killSignal) in their options object.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a82b76c5-5ee8-4b3e-a01f-ccbc6a9fd7c2
📒 Files selected for processing (10)
.github/workflows/nightly-e2e.yamltest/e2e/e2e-cloud-experimental/check-docs.shtest/e2e/e2e-cloud-experimental/cleanup.shtest/e2e/e2e-cloud-experimental/skip/01-onboard-completion.shtest/e2e/e2e-cloud-experimental/skip/04-nemoclaw-openshell-status-parity.shtest/e2e/e2e-cloud-experimental/skip/05-network-policy.shtest/e2e/test-e2e-cloud-experimental.shtest/e2e/test-spark-install.shtest/install-preflight.test.jstest/uninstall.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/nightly-e2e.yaml
| VERIFY=0 | ||
| if [ "${1:-}" = "--verify" ]; then | ||
| VERIFY=1 | ||
| fi |
There was a problem hiding this comment.
Reject unexpected args instead of silently downgrading to destroy-only.
Anything other than exact --verify currently falls through to the success path. A typo here disables the verification gate without failing CI.
Suggested fix
VERIFY=0
-if [ "${1:-}" = "--verify" ]; then
- VERIFY=1
-fi
+case "${1:-}" in
+ "")
+ ;;
+ --verify)
+ VERIFY=1
+ ;;
+ *)
+ printf '%s\n' "cleanup.sh: unknown option: ${1}" >&2
+ exit 2
+ ;;
+esac📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| VERIFY=0 | |
| if [ "${1:-}" = "--verify" ]; then | |
| VERIFY=1 | |
| fi | |
| VERIFY=0 | |
| case "${1:-}" in | |
| "") | |
| ;; | |
| --verify) | |
| VERIFY=1 | |
| ;; | |
| *) | |
| printf '%s\n' "cleanup.sh: unknown option: ${1}" >&2 | |
| exit 2 | |
| ;; | |
| esac |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/e2e-cloud-experimental/cleanup.sh` around lines 28 - 31, The script
currently treats any argument other than the exact string "--verify" as no-ops
and silently leaves VERIFY=0; change the argument handling to explicitly reject
unexpected args: check the positional parameters (e.g., test "$#" and "$1") and
if an argument is present and not equal to "--verify" print an error to stderr
and exit non-zero, otherwise set VERIFY=1 only when "$1" == "--verify"; update
the logic around the VERIFY variable and the if condition that reads "$1" so
typos or extra args cause a failing exit instead of silently downgrading to
destroy-only.
| if command -v openshell >/dev/null 2>&1; then | ||
| if openshell sandbox get "$SANDBOX_NAME" >/dev/null 2>&1; then | ||
| fail "openshell sandbox get '${SANDBOX_NAME}' still succeeds after cleanup" | ||
| exit 1 | ||
| fi | ||
| pass "openshell: sandbox '${SANDBOX_NAME}' no longer visible to sandbox get" |
There was a problem hiding this comment.
Don't treat CLI failures as proof the sandbox is gone.
openshell sandbox get can fail for auth/API/runtime reasons, and nemoclaw list failures are skipped entirely. In --verify mode that can report green while the sandbox still exists. Please fail unless you can positively identify a real “not found” result.
Also applies to: 60-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/e2e-cloud-experimental/cleanup.sh` around lines 50 - 55, The current
verification treats any nonzero exit from "openshell sandbox get" and skipped
failures from "nemoclaw list" as proof the sandbox is gone; change both checks
to positively detect a "not found" response and otherwise fail: run "openshell
sandbox get $SANDBOX_NAME" capturing stdout/stderr and exit code, and only treat
it as success when the output (or stderr) contains a canonical not-found string
(e.g. "not found", "No such sandbox", or the provider-specific message) — on any
other nonzero exit or ambiguous output, call fail/exit with an error; likewise
run "nemoclaw list" and parse its successful output for SANDBOX_NAME (treat
absence as not found) but if the list command itself errors or returns ambiguous
output, treat that as a test failure rather than success.
| run_whitelist_pypi_via_venv | ||
| run_whitelist_npm_via_cli | ||
| if [[ "$SKIP_HUGGINGFACE" == "1" ]]; then | ||
| printf '%s\n' "05-network-policy: SKIP huggingface whitelist (NEMOCLAW_E2E_SKIP_NETWORK_POLICY_HUGGINGFACE=1)" | ||
| else | ||
| run_whitelist_huggingface_via_cli | ||
| fi |
There was a problem hiding this comment.
Mirror the new Hugging Face probe in the YAML assertions.
This script now enforces Hugging Face egress by default, but section A still only proves npm/PyPI hosts are declared in policy_output. That means this can pass via an overly broad rule and still claim the declared policy is correct. Please add explicit checks for the expected Hugging Face hub/CDN entries too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh` around lines 349 -
355, Section A's YAML assertions need to include explicit checks for Hugging
Face hosts because the script now enforces Hugging Face egress by default;
update the assertions that currently only verify npm/PyPI entries to also assert
the expected Hugging Face hub/CDN entries are present in policy_output.
Specifically, add assertions (guarded by SKIP_HUGGINGFACE similar to
run_whitelist_huggingface_via_cli) that policy_output contains the Hugging Face
hostnames such as huggingface.co and the CDN/LFS domains used by the hub (e.g.,
cdn.huggingface.co / cdn-lfs.huggingface.co or your project's canonical HF
hostnames) so the test fails if those egress rules are not explicitly declared.
Ensure these checks live alongside the existing npm/PyPI checks and reference
the same policy_output variable used in the script.
| SETUP_SCRIPT="$REPO/scripts/setup-spark.sh" | ||
| INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}" |
There was a problem hiding this comment.
Default INSTALL_LOG can collide across runs.
The fixed /tmp/nemoclaw-e2e-spark-install.log path is shared by root/non-root and concurrent jobs, so another run can clobber the file you're tailing. Using $UID or mktemp would avoid the same collision class this PR already fixed for the cloud-experimental installer log.
Suggested fix
-INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}"
+INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.${UID}.log}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SETUP_SCRIPT="$REPO/scripts/setup-spark.sh" | |
| INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.log}" | |
| SETUP_SCRIPT="$REPO/scripts/setup-spark.sh" | |
| INSTALL_LOG="${INSTALL_LOG:-/tmp/nemoclaw-e2e-spark-install.${UID}.log}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-spark-install.sh` around lines 57 - 58, The INSTALL_LOG default
uses a fixed /tmp path which can collide across runs; update the default in
test-spark-install.sh so INSTALL_LOG is unique per process (e.g., use mktemp or
include the shell $UID/$PPID/PID in the filename) and ensure SETUP_SCRIPT and
any subsequent code that tails or writes INSTALL_LOG continue to reference the
new variable name; replace the fixed "/tmp/nemoclaw-e2e-spark-install.log" with
a dynamically generated name (mktemp or
"/tmp/nemoclaw-e2e-spark-install.${UID:-$$}.log") so concurrent or root/non-root
runs won't clobber each other.
| if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then | ||
| info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)" | ||
| pass "setup-spark skipped" | ||
| else | ||
| info "Running: sudo bash scripts/setup-spark.sh" | ||
| if sudo bash "$SETUP_SCRIPT"; then | ||
| pass "setup-spark completed" | ||
| else | ||
| fail "setup-spark failed" | ||
| exit 1 | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Make the setup step fail fast with sudo -n.
This smoke test is explicitly non-interactive, but plain sudo can block on a password/TTY prompt. In CI that turns into a hang instead of a clear prerequisite failure.
Suggested fix
- info "Running: sudo bash scripts/setup-spark.sh"
- if sudo bash "$SETUP_SCRIPT"; then
+ info "Running: sudo -n bash scripts/setup-spark.sh"
+ if sudo -n bash "$SETUP_SCRIPT"; then
pass "setup-spark completed"
else
- fail "setup-spark failed"
+ fail "setup-spark failed (passwordless sudo is required for this smoke test)"
exit 1
fi
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then | |
| info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)" | |
| pass "setup-spark skipped" | |
| else | |
| info "Running: sudo bash scripts/setup-spark.sh" | |
| if sudo bash "$SETUP_SCRIPT"; then | |
| pass "setup-spark completed" | |
| else | |
| fail "setup-spark failed" | |
| exit 1 | |
| fi | |
| fi | |
| if [ "${NEMOCLAW_E2E_SPARK_SKIP_SETUP:-0}" = "1" ]; then | |
| info "Skipping sudo setup-spark (NEMOCLAW_E2E_SPARK_SKIP_SETUP=1)" | |
| pass "setup-spark skipped" | |
| else | |
| info "Running: sudo -n bash scripts/setup-spark.sh" | |
| if sudo -n bash "$SETUP_SCRIPT"; then | |
| pass "setup-spark completed" | |
| else | |
| fail "setup-spark failed (passwordless sudo is required for this smoke test)" | |
| exit 1 | |
| fi | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-spark-install.sh` around lines 96 - 107, The setup step can
hang waiting for a sudo password; update the invocation that runs the setup
script in test-spark-install.sh to use non-interactive sudo by adding the -n
flag (i.e., replace the plain sudo call that runs "$SETUP_SCRIPT" with sudo -n)
so it fails fast when no password is available; keep the existing pass/fail
handling for NEMOCLAW_E2E_SPARK_SKIP_SETUP and the SETUP_SCRIPT variable
unchanged.
| if [ "${NEMOCLAW_E2E_PUBLIC_INSTALL:-0}" = "1" ]; then | ||
| url="${NEMOCLAW_INSTALL_SCRIPT_URL:-https://www.nvidia.com/nemoclaw.sh}" | ||
| info "Running: curl -fsSL ... | bash (url=$url)" | ||
| curl -fsSL "$url" | bash >"$INSTALL_LOG" 2>&1 & | ||
| else | ||
| info "Running: bash install.sh --non-interactive" | ||
| bash install.sh --non-interactive >"$INSTALL_LOG" 2>&1 & | ||
| fi | ||
| install_pid=$! | ||
| tail -f "$INSTALL_LOG" --pid=$install_pid 2>/dev/null & | ||
| tail_pid=$! | ||
| wait "$install_pid" | ||
| install_exit=$? | ||
| kill "$tail_pid" 2>/dev/null || true | ||
| wait "$tail_pid" 2>/dev/null || true |
There was a problem hiding this comment.
Bound the background installer with a real timeout.
wait "$install_pid" will sit until the job-level timeout if curl or install.sh stalls. Please wrap the install command in timeout/gtimeout so this phase fails with a log excerpt instead of hanging the runner.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-spark-install.sh` around lines 111 - 125, Wrap the background
installer invocation with a bounded timeout (use timeout or gtimeout) instead of
starting curl|bash or install.sh directly so wait "$install_pid" cannot hang;
e.g., use a configurable INSTALL_TIMEOUT env var and run either "timeout
$INSTALL_TIMEOUT curl -fsSL \"$url\" | bash" or "timeout $INSTALL_TIMEOUT bash
install.sh --non-interactive" (preserving INSTALL_LOG redirection and
backgrounding), keep the tail logic for INSTALL_LOG/tail_pid, capture the
timeout exit code into install_exit, and ensure you kill/wait the tail like
existing logic; update references to NEMOCLAW_E2E_PUBLIC_INSTALL, url,
INSTALL_LOG, install_pid, tail_pid accordingly.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh`:
- Around line 185-186: The pip probe currently can hit a warmed cache; modify
the python -m pip download invocation (the command that downloads idna==3.7) to
add the --no-cache-dir flag so it cannot use ~/.cache/pip, and for the Hugging
Face probe block (the HF download commands) set HF_HOME to a fresh temporary
directory (e.g., create and export a unique temp dir via mktemp -d) immediately
before those HF download commands and clean it up afterward so HF tools use an
empty cache and must perform real egress.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a2f3029-397d-4b90-b743-5170428240a9
📒 Files selected for processing (1)
test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh
| if ! python -m pip download --no-deps --disable-pip-version-check -d "$PROBE_DL" --timeout 90 idna==3.7; then | ||
| echo "pypi whitelist: pip download idna==3.7 from PyPI failed (egress / proxy / policy)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
wc -l test/e2e/e2e-cloud-experimental/skip/05-network-policy.shRepository: NVIDIA/NemoClaw
Length of output: 120
🏁 Script executed:
cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '180,190p'Repository: NVIDIA/NemoClaw
Length of output: 518
🏁 Script executed:
cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '275,290p'Repository: NVIDIA/NemoClaw
Length of output: 631
🏁 Script executed:
cat -n test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh | sed -n '300,330p'Repository: NVIDIA/NemoClaw
Length of output: 1456
🏁 Script executed:
# Check if there are any environment variables or flags that might already disable caching
rg -i "no-cache|cache|HF_HOME|HF_HUB_CACHE" test/e2e/e2e-cloud-experimental/skip/05-network-policy.shRepository: NVIDIA/NemoClaw
Length of output: 41
🌐 Web query:
pip download default cache behavior --no-cache-dir
💡 Result:
By default, pip download uses pip's cache (enabled since pip 6.0) to store HTTP responses and locally built wheels, avoiding redundant downloads and builds. The cache is located at ~/.cache/pip (Linux, respects XDG_CACHE_HOME), ~/Library/Caches/pip (macOS), or %LocalAppData%\pip\Cache (Windows). Use pip cache dir to find the exact path. The --no-cache-dir option disables this caching entirely, forcing fresh downloads every time. This is useful in scenarios like Docker builds to reduce image size or ensure fresh packages, but it's not recommended generally as it slows down operations by repeating downloads and builds.
Citations:
- 1: https://pip.pypa.io/en/stable/topics/caching/
- 2: https://pip.pypa.io/en/stable/topics/caching.html
- 3: https://pip.pypa.io/en/stable/cli/pip/?highlight=--no-cache-dir
🌐 Web query:
Hugging Face HF_HOME environment variable cache location
💡 Result:
The HF_HOME environment variable specifies the base directory where huggingface_hub stores local data, including tokens and various caches. It defaults to ~/.cache/huggingface (or $XDG_CACHE_HOME/huggingface if XDG_CACHE_HOME is set and HF_HOME is not). Key cache locations under HF_HOME: - Hub cache (models, datasets, spaces): $HF_HOME/hub (default: ~/.cache/huggingface/hub) - Token: $HF_HOME/token - Xet cache: $HF_HOME/xet - Assets cache: $HF_HOME/assets To use it, set the environment variable before importing huggingface_hub, e.g.: export HF_HOME=/path/to/custom/dir or in Python: import os os.environ['HF_HOME'] = '/path/to/custom/dir' Note: HF_HUB_CACHE can override just the hub cache specifically, defaulting to $HF_HOME/hub.
Citations:
- 1: https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables
- 2: https://huggingface.co/docs/huggingface_hub/main/package_reference/environment_variables
- 3: https://huggingface.co/docs/huggingface_hub/v0.25.0/guides/manage-cache
- 4: https://huggingface.co/docs/hub/main/local-cache
- 5: https://huggingface.co/docs/datasets/en/cache
Disable caches for PyPI and Hugging Face probes to ensure egress testing is reliable.
The pip download command reuses the default pip cache at ~/.cache/pip, and HF tools reuse the cache at ~/.cache/huggingface. Since this script runs late in the sandbox, previous operations may have warmed these caches, allowing downloads to succeed without actual egress. Add --no-cache-dir to line 185, and set HF_HOME to a fresh temporary directory before the HF download commands (lines 305–324) to force fresh network requests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/e2e-cloud-experimental/skip/05-network-policy.sh` around lines 185 -
186, The pip probe currently can hit a warmed cache; modify the python -m pip
download invocation (the command that downloads idna==3.7) to add the
--no-cache-dir flag so it cannot use ~/.cache/pip, and for the Hugging Face
probe block (the HF download commands) set HF_HOME to a fresh temporary
directory (e.g., create and export a unique temp dir via mktemp -d) immediately
before those HF download commands and clean it up afterward so HF tools use an
empty cache and must perform real egress.
Summary
Remove the
environment: NVIDIA_API_KEYbinding from thecloud-experimental-e2ejob in the nightly workflow. GitHub treatsenvironmentas a deployment target, which created a confusing Deployments page whose name looked like a secret. The job already receives the API key viaenvand repository secretsecrets.NVIDIA_API_KEY, matching thecloud-e2ejob.Also default the cloud-experimental Phase 3 install log path to include
${UID}so root and non-root runs do not share the same file.Related Issue
N/A (internal CI clarity / follow-up from team discussion).
Changes
.github/workflows/nightly-e2e.yaml: dropenvironment: NVIDIA_API_KEYfromcloud-experimental-e2e.test/e2e/test-e2e-cloud-experimental.sh: defaultINSTALL_LOGuses/tmp/nemoclaw-e2e-cloud-experimental-install.${UID}.log; comment updated accordingly.Type of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
N/A — no doc-only section required for this PR.
Summary by CodeRabbit
New Features
Tests
Chores