fix(cli): restore logs streaming and reboot recovery ux by kjw3 · Pull Request #1187 · NVIDIA/NemoClaw

kjw3 · 2026-03-31T14:20:03Z

Summary

restore nemoclaw <name> logs --follow by mapping the user-facing --follow flag to the OpenShell --tail flag
add compatibility guidance when the installed OpenShell CLI is too old for NemoClaw log streaming
recover local sandbox inventory from the last onboard session so sandboxes do not appear to disappear after a reboot
import additional live sandbox names from the current OpenShell gateway during list recovery
make nemoclaw <sandbox-name> connect recover missing local registry state directly instead of introducing a separate top-level reconnect command

This absorbs the useful parts of #1148 and the recovery ideas from #960, but keeps the product surface on the command users already expect to reach for after a reboot: nemoclaw <name> connect.

Issues

Fixes #1146
Fixes #1165
Addresses #990
Addresses #1154

Why this shape

#1148 is the right fix for the logs regression, but too narrow by itself
reboot recovery is more than a connect alias; NemoClaw has to rebuild local inventory before connect can succeed consistently
users will try nemoclaw <name> connect first, so recovery belongs on that path rather than behind a new top-level command
list now repairs stale or missing local sandbox inventory so the user sees recoverable state instead of an empty registry after reboot

Changes

Logs

map nemoclaw logs --follow to openshell logs --tail
keep --follow as the NemoClaw user-facing flag
print directed upgrade guidance for older or incompatible OpenShell log subcommands instead of leaving only raw parser failures
preserve normal SIGINT exit semantics for logs --follow

Reboot recovery

recover sandbox inventory from ~/.nemoclaw/onboard-session.json when local registry is missing or stale
import additional live sandbox names from the active or recovered OpenShell gateway during list recovery
preserve policy presets and partial registries during recovery instead of flattening them to an empty/default state
keep registry writes serialized through the existing advisory lock plus atomic rename path in bin/lib/registry.js

Connect UX

remove the top-level nemoclaw reconnect command
make nemoclaw <sandbox-name> connect attempt recovery before falling into the unknown-command path
reuse the existing sandboxConnect() flow after recovery instead of introducing a second connection path

Cross-platform validation

mac

branch/commit: fix/logs-follow-reconnect-reboot / 5191d0d
commands:
- /Users/kejones/Git/nvidia/NemoClaw-2/node_modules/.bin/vitest run test/cli.test.js -t "connect"
result: passed (6 connect-focused tests passed, 26 skipped)

brev-cpu (`kj-nemoclaw-cpu-test`)

branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
node runtime used explicitly: /home/ubuntu/.nvm/versions/node/v22.22.2/bin/node
validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
result: passed
verified behavior:
- nemoclaw alpha connect recovered from missing local registry state
- recovery probed openshell sandbox list
- connect path then executed openshell sandbox get alpha
- connect path then executed openshell sandbox connect alpha

brev-gpu (`kj-nemoclaw-l40s-test`)

branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
node runtime used explicitly: /home/ubuntu/.nvm/versions/node/v22.22.1/bin/node
validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
result: passed
verified behavior:
- nemoclaw alpha connect recovered from missing local registry state
- recovery probed openshell sandbox list
- connect path then executed openshell sandbox get alpha
- connect path then executed openshell sandbox connect alpha

spark (`spark-d8c8`)

branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
node runtime used: v22.22.1
validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
result: passed
verified behavior:
- nemoclaw alpha connect recovered from missing local registry state
- recovery probed openshell sandbox list
- connect path then executed openshell sandbox get alpha
- connect path then executed openshell sandbox connect alpha

Credits

This branch pulls forward ideas from earlier good-faith work, but reworks them onto the current codebase and recovery model:

#1148 by senthilr-nv
- provided the correct narrow fix for the --follow -> --tail logs regression
#960 by WuKongAI-CMU (Peter)
- provided useful recovery and reconnect test ideas, even though the final product direction here keeps recovery on connect

Notes

did not merge #1148 directly; absorbed its fix and test intent here
did not keep the top-level reconnect command from #960; the final UX puts recovery on nemoclaw <name> connect

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

New Features
- OpenShell version compatibility check for logs with clear upgrade guidance
- Enhanced sandbox discovery that recovers entries from prior onboarding and live probes
- Automatic sandbox recovery during connect when a sandbox is missing locally
Bug Fixes
- Corrected logs streaming behavior for follow/non-follow modes and improved error handling
Tests
- Added CLI tests for logs follow behavior, registry recovery/persistence, connect recovery, and SIGINT handling

coderabbitai · 2026-03-31T14:20:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Async registry-recovery seeded from onboard session and optional OpenShell probe; OpenShell semver parsing and logs compatibility gate; --follow translated to --tail with modified spawn/stdio behavior for logs; pre-connect recovery added. Tests expanded for logs, recovery, and SIGINT behavior.

Changes

Cohort / File(s)	Summary
CLI core: logs & recovery `bin/nemoclaw.js`	Added async `recoverRegistryEntries()` to rebuild/seed `sandboxes.json` from `onboard-session.json` and optionally probe OpenShell; `list` now awaits async recovery; `connect` attempts pre-dispatch recovery when a named sandbox is missing. Added OpenShell semver parsing and `MIN_LOGS_OPENSHELL_VERSION` gating; rewrote `sandboxLogs()` to map `--follow` → `--tail`, change spawn/stdio handling for follow vs non-follow, detect incompatible OpenShell behavior and print upgrade/usage guidance or exit.
Tests: CLI dispatch & recovery `test/cli.test.js`	Expanded tests to assert `logs --follow` is forwarded as `--tail`, added non-follow logs coverage, added compatibility/error test for older OpenShell exiting with code 1 and printing upgrade guidance, added registry recovery tests for `list` and `connect` (seed from `onboard-session.json`, optional `openshell sandbox list` probe, skip invalid names), and a SIGINT semantics test for `logs --follow`.
Test harness imports `test/...`	Added `spawnSync` import from `node:child_process` where needed to support signal/exit assertions in tests.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "nemoclaw CLI"
    participant Reg as "Local Registry\n(sandboxes.json)"
    participant Session as "onboard-session.json"
    participant GW as "OpenShell (openshell)"

    CLI->>Reg: Load registry
    alt registry incomplete
        CLI->>Session: Read last onboard session
        Session-->>CLI: Return last onboarded sandbox info
        CLI->>Reg: Seed missing sandbox metadata
        CLI->>GW: Optional: `openshell sandbox list`
        GW-->>CLI: Return live sandbox list
        CLI->>Reg: Upsert gateway entries & update default
        Reg-->>CLI: Persist recovered registry
    end
    CLI->>CLI: Parse `logs <name> --follow`
    CLI->>CLI: Check OpenShell version (semver)
    alt supports --tail
        CLI->>GW: Run `openshell logs <name> --tail` (stream / spawn handling)
        GW-->>CLI: Stream output / exit
    else incompatible
        CLI-->>User: Print compatibility guidance and exit(1)
    end
    CLI->>CLI: Pre-dispatch recovery for `connect <name>` if missing
    CLI->>GW: `openshell sandbox connect <name>`
    GW-->>CLI: Connection result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I stitched lost sandboxes from yesterday's file,

swapped your follow for a tail with a smile.
I probed the gateway, coaxed registries back,
handled SIGINT so your streams won't crack.
Hop in — recovered logs are ready to track.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title concisely summarizes the main changes: restoring logs streaming functionality and improving recovery UX, which aligns with the primary objectives of fixing --follow flag handling and implementing registry recovery.
Linked Issues check	✅ Passed	The PR fully addresses all coding requirements from `#1146` and `#1165`: maps --follow to --tail [`#1146`,`#1165`], maintains kubectl-style UX [`#1165`], provides compatibility guidance [`#1165`], updates tests to assert --tail translation [`#1146`], implements registry recovery for list/connect [PR objectives], and preserves SIGINT semantics [PR objectives].
Out of Scope Changes check	✅ Passed	All changes are in-scope and aligned with objectives: OpenShell version compatibility handling, --follow to --tail translation, registry recovery from onboard-session.json, connect pre-dispatch recovery, and comprehensive test coverage directly support the stated PR goals.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/logs-follow-reconnect-reboot

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

test/cli.test.js (1)

107-148: ⚠️ Potential issue | 🟡 Minor

Also assert that --follow never reaches OpenShell.

This still passes for openshell logs alpha --follow --tail, which is the exact compatibility regression this change is trying to prevent.

🧪 Suggested assertion

     const r = runWithEnv("alpha logs --follow", {
       HOME: home,
       PATH: `${localBin}:${process.env.PATH || ""}`,
     });

     expect(r.code).toBe(0);
-    expect(fs.readFileSync(markerFile, "utf8")).toContain("logs alpha --tail");
+    const recordedArgs = fs.readFileSync(markerFile, "utf8");
+    expect(recordedArgs).toContain("logs alpha --tail");
+    expect(recordedArgs).not.toContain("--follow");
   });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 107 - 148, Update the test "maps --follow to
openshell --tail" to also assert that the spawned openshell never receives the
original --follow flag: after reading markerFile (written by the fake openshell
script) add an assertion such as expect(fs.readFileSync(markerFile,
"utf8")).not.toContain("--follow") (or assert equality to the exact expected
args string like "logs alpha --tail") so runWithEnv/openshell invocation does
not include "--follow".

🧹 Nitpick comments (1)

test/cli.test.js (1)

354-528: Add a partial-registry recovery case.

These tests only start from an empty sandboxes.json. The current recovery bug shows up when one local sandbox already exists and list is supposed to merge the session sandbox or extra live names, so this suite will not catch that regression yet.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 354 - 528, Add a partial-registry recovery
case by pre-populating .nemoclaw/sandboxes.json with an existing sandbox entry
and asserting the list recovery merges new session/live sandboxes instead of
overwriting; modify the two tests ("recovers a missing registry entry from the
last onboard session during list" and "imports additional live sandboxes into
the registry during list recovery") to create a sandboxes.json before running
runWithEnv("list") (e.g., write JSON with sandboxes: { "gamma": { ... } } and
defaultSandbox: "gamma"), then assert after the run that both the original
"gamma" and the recovered "alpha" (and "beta" in the second test) exist and
defaultSandbox remains correct or is updated as expected.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 117-125: The recovered sandbox entries drop persisted policy
presets because buildRecoveredSandboxEntry only copies metadata.policies and
ignores metadata.policyPresets; update buildRecoveredSandboxEntry to populate
policies from metadata.policies if present, otherwise fall back to
metadata.policyPresets (and default to []), e.g. set policies =
Array.isArray(metadata.policies) ? metadata.policies :
(Array.isArray(metadata.policyPresets) ? metadata.policyPresets : []); make the
same change in the analogous recovered-entry builder used for agents/recovered
sessions so both sources of presets are preserved.
- Around line 138-146: The current shouldRecoverRegistryEntries incorrectly
prevents recovery when the registry contains any entries even if those entries
are stale or a session/requested sandbox should be re-seeded; update the logic
in shouldRecoverRegistryEntries so that hasRecoverySeed is true when there is a
session.sandboxName or a requestedSandboxName (not just when
current.sandboxes.length > 0), and compute shouldRecover as true whenever the
registry is empty OR the requested sandbox is missing OR a session/requested
sandbox is present that isn’t represented in current.sandboxes; change the
definitions around missingRequestedSandbox, hasRecoverySeed and shouldRecover in
the shouldRecoverRegistryEntries function to reflect this (use
current.sandboxes.some(...) to detect presence and trigger recovery when that
check fails even if sandboxes.length > 0).
- Around line 854-880: The current spawnSync result handling in the logs command
block treats a SIGINT interrupt as a failure; modify the error/exit path to call
the existing exitWithSpawnResult(result) helper (used elsewhere) instead of
directly printing "Command failed" and process.exit when result.status is null
or result.signal is set, so signals are converted to proper exit codes; keep the
existing compatibility check
(printOldLogsCompatibilityGuidance(installedVersion) when the combined output
matches the old-logs patterns or version is too old) but invoke
exitWithSpawnResult(result) for non-zero/null status handling to mirror the
other call sites.

---

Outside diff comments:
In `@test/cli.test.js`:
- Around line 107-148: Update the test "maps --follow to openshell --tail" to
also assert that the spawned openshell never receives the original --follow
flag: after reading markerFile (written by the fake openshell script) add an
assertion such as expect(fs.readFileSync(markerFile,
"utf8")).not.toContain("--follow") (or assert equality to the exact expected
args string like "logs alpha --tail") so runWithEnv/openshell invocation does
not include "--follow".

---

Nitpick comments:
In `@test/cli.test.js`:
- Around line 354-528: Add a partial-registry recovery case by pre-populating
.nemoclaw/sandboxes.json with an existing sandbox entry and asserting the list
recovery merges new session/live sandboxes instead of overwriting; modify the
two tests ("recovers a missing registry entry from the last onboard session
during list" and "imports additional live sandboxes into the registry during
list recovery") to create a sandboxes.json before running runWithEnv("list")
(e.g., write JSON with sandboxes: { "gamma": { ... } } and defaultSandbox:
"gamma"), then assert after the run that both the original "gamma" and the
recovered "alpha" (and "beta" in the second test) exist and defaultSandbox
remains correct or is updated as expected.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf043a1-fc52-4733-96ff-8202a0be56a7

📥 Commits

Reviewing files that changed from the base of the PR and between b34c4f4 and d564ea2.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/cli.test.js

bin/nemoclaw.js

ericksoa

Nice work absorbing the useful parts of #1148 and #960 — the recovery model is thoughtful and the test coverage is solid (10 new tests). A few questions inline about behavioral edges.

ericksoa · 2026-03-31T15:31:32Z

bin/nemoclaw.js

+    cwd: ROOT,
+    env: process.env,
+    encoding: "utf-8",
+    stdio: follow ? ["ignore", "inherit", "pipe"] : ["ignore", "pipe", "pipe"],


Nice catch on the --follow → --tail translation — this clearly fixes the regression.

One thing I wanted to think through with you on the stdio configuration: in follow mode, stderr is piped (["ignore", "inherit", "pipe"]), which means any errors openshell writes during a long-running streaming session won't be visible to the user until after they Ctrl-C. Could that mask a mid-stream error in a way that would be confusing? Would it be worth inheriting stderr too and handling the version-compatibility detection a different way (e.g., checking the version upfront only, which you already do)?

Good question. I kept stderr piped in follow mode for this PR because the compatibility path still needs to inspect parser failures from older OpenShell builds, and those show up on stderr before a stream is established. Since we now check the installed version up front as well, I agree the remaining tradeoff is mid-stream stderr visibility versus preserving that fallback detection. I chose to keep the existing detection path stable here and would treat inheriting stderr in follow mode as a follow-up cleanup if we want to simplify around the version gate alone.

ericksoa · 2026-03-31T15:31:32Z

bin/nemoclaw.js

+  };
+}
+
+function upsertRecoveredSandbox(name, metadata = {}) {


I notice resolveReconnectSandboxName runs validateName() on sandbox names that come from user input — nice. Looking at upsertRecoveredSandbox, though, names sourced from onboard-session.json or the gateway sandbox list output flow straight to registry.registerSandbox() without the same validation. Is there a reason to skip validation here, or would it be worth being consistent across both paths?

That is a fair consistency question. I originally skipped validation there because those names come from NemoClaw/OpenShell-managed sources rather than raw user input, and the gateway path is already constrained by OpenShell sandbox naming rules. I do agree that validating before registry writes would be a reasonable hardening step; I just did not want to expand this PR again after the UX direction changed to remove reconnect and move recovery onto connect.

ericksoa · 2026-03-31T15:31:32Z

bin/nemoclaw.js


-function listSandboxes() {
-  const { sandboxes, defaultSandbox } = registry.listSandboxes();
+async function listSandboxes() {


This is a thoughtful approach to the reboot problem — recovering the registry transparently during list so the user doesn't have to know about it.

One behavioral shift worth calling out: list was previously a pure read; now it can write to sandboxes.json as a side effect. Two questions come to mind:

If someone runs nemoclaw list while nemoclaw onboard is in progress on another terminal, could both be writing to sandboxes.json concurrently? How does the registry lock handle that?

Would it be useful to mention the write-on-read behavior in the help text or a doc note, so users who script around list aren't surprised?

I checked the registry implementation before answering this. The write path is serialized through the existing advisory lock in bin/lib/registry.js (mkdir lock dir + owner PID + stale-lock handling) and the file write itself is tmp-file-plus-rename, so list and onboard should serialize rather than clobber each other. I also updated the PR body to call out explicitly that list now repairs local inventory on read, since that side effect is worth documenting.

ericksoa · 2026-03-31T15:31:32Z

bin/nemoclaw.js

+  exitWithSpawnResult(result);
+}
+
+async function reconnect(args = []) {


The reconnect flow is clean — I like that it reuses sandboxConnect rather than inventing a new connection path.

Thinking through the call chain: reconnect calls recoverRegistryEntries() which may probe the gateway (via recoverRegistryFromLiveGateway → recoverNamedGatewayRuntime), then calls sandboxConnect() → ensureLiveSandboxOrExit() which can call recoverNamedGatewayRuntime() again. In the worst case, is that a double gateway probe? If so, is the cost low enough not to worry about, or would it be worth short-circuiting the second one if recovery already succeeded?

I agreed with the UX concern behind this thread and changed the shape after that review pass: the top-level reconnect command is gone now, and recovery moved onto nemoclaw <name> connect. So this specific double-probe path is no longer part of the PR. The updated flow is: dispatch sees <name> connect, attempts recovery for that requested sandbox, and if recovery succeeds it drops straight into the existing sandboxConnect() path.

kjw3 · 2026-03-31T17:30:04Z

Updated this PR to follow the product direction from the review discussion:

removed the top-level nemoclaw reconnect command
moved recovery onto nemoclaw <sandbox-name> connect, which now attempts local inventory recovery before the unknown-command path
kept list as the recovery entry point for rebuilding local inventory after reboot

I also refreshed the PR body to match the new shape and replaced the old validation notes with the current 5191d0d validation:

mac: connect-focused vitest pass
brev-cpu: passed using explicit Node v22.22.2 from ~/.nvm
brev-gpu: passed using explicit Node v22.22.1 from ~/.nvm
spark: passed using Node v22.22.1

For the cross-platform environment checks, I validated the recovery path directly: nemoclaw alpha connect recovers missing local registry state, probes openshell sandbox list, and then reaches openshell sandbox get/connect alpha on the updated branch.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

bin/nemoclaw.js (1)

132-139: ⚠️ Potential issue | 🟠 Major

Validate recovered sandbox names before persisting them.

Line 132 upserts names from recovery sources without the same validation used for user input paths. This can persist invalid names into sandboxes.json and create follow-on failures in command flows that assume validated names.

🛠️ Suggested hardening

+function isValidRecoveredSandboxName(name) {
+  return typeof name === "string" && /^[a-z0-9][a-z0-9-]*[a-z0-9]$/.test(name);
+}
+
 function upsertRecoveredSandbox(name, metadata = {}) {
+  if (!isValidRecoveredSandboxName(name)) {
+    return false;
+  }
   const entry = buildRecoveredSandboxEntry(name, metadata);
   if (registry.getSandbox(name)) {
     registry.updateSandbox(name, entry);
     return false;
   }
   registry.registerSandbox(entry);
   return true;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 132 - 139, upsertRecoveredSandbox currently
persists recovered names without validation; before calling
buildRecoveredSandboxEntry or registry.registerSandbox/updateSandbox, run the
recovered name through the same sandbox name validation/sanitization used for
user-entered paths (e.g. the project's validate/sanitize function for sandbox
names) and skip or normalize any invalid names; specifically, validate the name
variable at the top of upsertRecoveredSandbox, return false or log and skip when
invalid, and only call buildRecoveredSandboxEntry, registry.updateSandbox or
registry.registerSandbox for names that pass validation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 117-129: buildRecoveredSandboxEntry correctly includes
nimContainer in the recovered entry, but registerSandbox in bin/lib/registry.js
does not persist nimContainer, so recovered entries lose their container
linkage; update the registerSandbox(sandbox) implementation to accept and store
sandbox.nimContainer (alongside name, model, provider, gpuEnabled, policies)
when creating/persisting the sandbox record and ensure any normalization/DB
write paths (e.g., the code paths that previously handled
model/provider/policies) include nimContainer so the value from
buildRecoveredSandboxEntry is retained after registry.registerSandbox(entry) is
called.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 132-139: upsertRecoveredSandbox currently persists recovered names
without validation; before calling buildRecoveredSandboxEntry or
registry.registerSandbox/updateSandbox, run the recovered name through the same
sandbox name validation/sanitization used for user-entered paths (e.g. the
project's validate/sanitize function for sandbox names) and skip or normalize
any invalid names; specifically, validate the name variable at the top of
upsertRecoveredSandbox, return false or log and skip when invalid, and only call
buildRecoveredSandboxEntry, registry.updateSandbox or registry.registerSandbox
for names that pass validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e4c2dbb-46da-4781-bbcf-703787e06543

📥 Commits

Reviewing files that changed from the base of the PR and between afbd703 and 888b19e.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/cli.test.js

coderabbitai · 2026-03-31T17:35:13Z

bin/nemoclaw.js

+function buildRecoveredSandboxEntry(name, metadata = {}) {
+  return {
+    name,
+    model: metadata.model || null,
+    provider: metadata.provider || null,
+    gpuEnabled: metadata.gpuEnabled === true,
+    policies: Array.isArray(metadata.policies)
+      ? metadata.policies
+      : Array.isArray(metadata.policyPresets)
+        ? metadata.policyPresets
+        : [],
+    nimContainer: metadata.nimContainer || null,
+  };


⚠️ Potential issue | 🟠 Major

Recovered nimContainer metadata is still lost on first insert.

Line 128 includes nimContainer, but Line 138 inserts through registry.registerSandbox(entry), and bin/lib/registry.js currently does not persist nimContainer in registerSandbox(). Result: newly recovered entries lose container linkage immediately.

🧩 Cross-file fix (`bin/lib/registry.js`)

function registerSandbox(entry) { return withLock(() => { const data = load(); const normalized = normalizeEntry(entry); if (!normalized) { throw new Error(`Invalid sandbox entry: ${entry.name}`); } data.sandboxes[normalized.name] = { name: normalized.name, model: normalized.model, provider: normalized.provider, gpuEnabled: normalized.gpuEnabled, policies: normalized.policies || [], + nimContainer: normalized.nimContainer, }; if (!data.defaultSandbox) { data.defaultSandbox = entry.name; } save(data); }); }

Also applies to: 132-139

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 117 - 129, buildRecoveredSandboxEntry correctly includes nimContainer in the recovered entry, but registerSandbox in bin/lib/registry.js does not persist nimContainer, so recovered entries lose their container linkage; update the registerSandbox(sandbox) implementation to accept and store sandbox.nimContainer (alongside name, model, provider, gpuEnabled, policies) when creating/persisting the sandbox record and ensure any normalization/DB write paths (e.g., the code paths that previously handled model/provider/policies) include nimContainer so the value from buildRecoveredSandboxEntry is retained after registry.registerSandbox(entry) is called.

ericksoa

Looks good — all prior comments addressed. The recovery pipeline is well-factored, test coverage is solid (10 new tests including partial-registry cases), and the connect-based recovery is a cleaner UX than a separate reconnect command.

Two minor follow-ups for a future PR, neither blocking:

Inheriting stderr in follow mode once the version gate is the sole compat check
Validating recovered sandbox names before registry writes for consistency

Approving.

kjw3 · 2026-03-31T17:56:11Z

Thanks, Aaron. I'll merge once all CI goes green

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

bin/nemoclaw.js (1)

149-163: ⚠️ Potential issue | 🟠 Major

Partial registries can still skip the live-gateway rebuild.

Line 160 only enables recovery when the registry is empty, the requested sandbox is missing, or the session sandbox is missing. If sandboxes.json already contains the session sandbox but is still missing other live sandboxes, Line 718 never reaches recoverRegistryFromLiveGateway(), so nemoclaw list stays incomplete after reboot.

🛠️ Suggested direction

-function shouldRecoverRegistryEntries(current, session, requestedSandboxName) {
+function shouldRecoverRegistryEntries(current, session, requestedSandboxName, { probeLiveGateway = false } = {}) {
   const hasSessionSandbox = Boolean(session?.sandboxName);
   const missingSessionSandbox =
     hasSessionSandbox &&
     !current.sandboxes.some((sandbox) => sandbox.name === session.sandboxName);
   const missingRequestedSandbox =
     Boolean(requestedSandboxName) &&
     !current.sandboxes.some((sandbox) => sandbox.name === requestedSandboxName);
   const hasRecoverySeed = current.sandboxes.length > 0 || hasSessionSandbox || Boolean(requestedSandboxName);
   return {
     missingRequestedSandbox,
     shouldRecover:
       hasRecoverySeed &&
-      (current.sandboxes.length === 0 || missingRequestedSandbox || missingSessionSandbox),
+      (current.sandboxes.length === 0 || missingRequestedSandbox || missingSessionSandbox || probeLiveGateway),
   };
 }

-async function recoverRegistryEntries({ requestedSandboxName = null } = {}) {
+async function recoverRegistryEntries({ requestedSandboxName = null, probeLiveGateway = false } = {}) {
   const current = registry.listSandboxes();
   const session = onboardSession.loadSession();
-  const recoveryCheck = shouldRecoverRegistryEntries(current, session, requestedSandboxName);
+  const recoveryCheck = shouldRecoverRegistryEntries(current, session, requestedSandboxName, { probeLiveGateway });
   if (!recoveryCheck.shouldRecover) {
     return { ...current, recoveredFromSession: false, recoveredFromGateway: 0 };
   }

   const seeded = seedRecoveryMetadata(current, session, requestedSandboxName);
-  const shouldProbeLiveGateway = current.sandboxes.length > 0 || Boolean(session?.sandboxName);
+  const shouldProbeLiveGateway = probeLiveGateway || current.sandboxes.length > 0 || Boolean(session?.sandboxName);

-  const recovery = await recoverRegistryEntries();
+  const recovery = await recoverRegistryEntries({ probeLiveGateway: true });

Also applies to: 226-238, 717-719

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 149 - 163, The current
shouldRecoverRegistryEntries logic misses the case of a partial registry
(registry contains some sandboxes but is missing other live sandboxes), so
change the function signature shouldRecoverRegistryEntries(current, session,
requestedSandboxName, hasMissingLiveSandboxes=false) and include
hasMissingLiveSandboxes in the shouldRecover condition (i.e., set shouldRecover
to true when hasMissingLiveSandboxes is true OR when current.sandboxes.length
=== 0 OR missingRequestedSandbox OR missingSessionSandbox). Update all call
sites (the code around the uses at the blocks referenced near the original diffs
and where recoverRegistryFromLiveGateway() is invoked) to compute/pass
hasMissingLiveSandboxes (by comparing current.sandboxes to the live-gateway list
before calling this helper) so recoverRegistryFromLiveGateway() runs for partial
registries as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 217-223: The recovery logic in applyRecoveredDefault currently
prefers requestedSandboxName even when a currentDefaultSandbox exists, causing
silent global default changes; change it so the default is only set when there
is no currentDefaultSandbox (i.e., only use requestedSandboxName or
session.sandboxName when currentDefaultSandbox is falsy), then call
registry.setDefault(preferredDefault) only if preferredDefault is truthy and
exists in recovered.sandboxes; update the same pattern at the other occurrence
(around lines 1052-1057) to avoid silently retargeting the global default during
connect recovery.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 149-163: The current shouldRecoverRegistryEntries logic misses the
case of a partial registry (registry contains some sandboxes but is missing
other live sandboxes), so change the function signature
shouldRecoverRegistryEntries(current, session, requestedSandboxName,
hasMissingLiveSandboxes=false) and include hasMissingLiveSandboxes in the
shouldRecover condition (i.e., set shouldRecover to true when
hasMissingLiveSandboxes is true OR when current.sandboxes.length === 0 OR
missingRequestedSandbox OR missingSessionSandbox). Update all call sites (the
code around the uses at the blocks referenced near the original diffs and where
recoverRegistryFromLiveGateway() is invoked) to compute/pass
hasMissingLiveSandboxes (by comparing current.sandboxes to the live-gateway list
before calling this helper) so recoverRegistryFromLiveGateway() runs for partial
registries as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: efe6fa8b-b153-4fb7-ab88-c4aac028fe75

📥 Commits

Reviewing files that changed from the base of the PR and between 888b19e and 76b210f.

📒 Files selected for processing (2)

bin/nemoclaw.js
test/cli.test.js

coderabbitai · 2026-03-31T17:59:07Z

bin/nemoclaw.js

+  metadataByName.set(
+    session.sandboxName,
+    buildRecoveredSandboxEntry(session.sandboxName, {
+      model: session.model || null,
+      provider: session.provider || null,
+      nimContainer: session.nimContainer || null,
+      policyPresets: session.policyPresets || null,
+    })
+  );
+  const sessionSandboxMissing = !current.sandboxes.some((sandbox) => sandbox.name === session.sandboxName);
+  const shouldRecoverSessionSandbox =
+    current.sandboxes.length === 0 || sessionSandboxMissing || requestedSandboxName === session.sandboxName;
+  if (shouldRecoverSessionSandbox) {
+    recoveredFromSession = upsertRecoveredSandbox(session.sandboxName, metadataByName.get(session.sandboxName));
+  }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail REGISTRY_FILE="$(fd '^registry\.js$' bin | head -n1)" echo "Inspecting: $REGISTRY_FILE" rg -n -C6 'function normalizeEntry|function updateSandbox|gpuEnabled|policies|nimContainer' "$REGISTRY_FILE"

Repository: NVIDIA/NemoClaw

Length of output: 929

🏁 Script executed:

cat -n bin/lib/registry.js | sed -n '165,200p'

Repository: NVIDIA/NemoClaw

Length of output: 1211

🏁 Script executed:

cat -n bin/nemoclaw.js | sed -n '174,215p'

Repository: NVIDIA/NemoClaw

Length of output: 1878

🏁 Script executed:

rg -n -A15 'function buildRecoveredSandboxEntry|function upsertRecoveredSandbox' bin/nemoclaw.js

Repository: NVIDIA/NemoClaw

Length of output: 999

Session recovery can reset critical sandbox fields when updating existing sandboxes.

Line 142 in bin/lib/registry.js calls updateSandbox() with an entry created by buildRecoveredSandboxEntry(). Although updateSandbox() uses Object.assign() (a merge operation at line 172), the recovered entry is built from session snapshots that don't include all original fields. Specifically:

buildRecoveredSandboxEntry() defaults gpuEnabled to false (line 122) if missing from metadata

It defaults policies to [] (line 123–127) if missing from metadata

When seedRecoveryMetadata() processes a session at lines 174–182 and later passes it through upsertRecoveredSandbox(), these incomplete snapshots will overwrite existing sandbox fields with defaults, silently resetting gpuEnabled, policies, or nimContainer on already-registered sandboxes.

Preserve existing fields by not overwriting entries in metadataByName when they already exist, or ensure session snapshots only update fields that were actually saved in the session.

coderabbitai · 2026-03-31T17:59:07Z

bin/nemoclaw.js

+function applyRecoveredDefault(currentDefaultSandbox, requestedSandboxName, session) {
+  const recovered = registry.listSandboxes();
+  const preferredDefault = requestedSandboxName || (!currentDefaultSandbox ? session?.sandboxName || null : null);
+  if (preferredDefault && recovered.sandboxes.some((sandbox) => sandbox.name === preferredDefault)) {
+    registry.setDefault(preferredDefault);
+  }
+  return registry.listSandboxes();


⚠️ Potential issue | 🟡 Minor

connect recovery shouldn't silently retarget the default sandbox.

Line 219 prefers requestedSandboxName even when currentDefaultSandbox is already set. That means beta connect can repair the registry and also change later global commands to operate on beta, which is a surprising side effect for a recovery path.

🛠️ Suggested fix

function applyRecoveredDefault(currentDefaultSandbox, requestedSandboxName, session) { const recovered = registry.listSandboxes(); - const preferredDefault = requestedSandboxName || (!currentDefaultSandbox ? session?.sandboxName || null : null); + const preferredDefault = currentDefaultSandbox + ? null + : requestedSandboxName || session?.sandboxName || null; if (preferredDefault && recovered.sandboxes.some((sandbox) => sandbox.name === preferredDefault)) { registry.setDefault(preferredDefault); } return registry.listSandboxes(); }

Also applies to: 1052-1057

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 217 - 223, The recovery logic in applyRecoveredDefault currently prefers requestedSandboxName even when a currentDefaultSandbox exists, causing silent global default changes; change it so the default is only set when there is no currentDefaultSandbox (i.e., only use requestedSandboxName or session.sandboxName when currentDefaultSandbox is falsy), then call registry.setDefault(preferredDefault) only if preferredDefault is truthy and exists in recovered.sandboxes; update the same pattern at the other occurrence (around lines 1052-1057) to avoid silently retargeting the global default during connect recovery.

coderabbitai

🧹 Nitpick comments (1)

test/cli.test.js (1)

349-845: Consider extracting shared test fixtures to reduce duplication.

The repeated onboard-session payloads and openshell shim scaffolding make this section expensive to maintain; a few helpers (e.g., writeRegistry, writeSession, writeOpenshellShim) would reduce drift risk.

♻️ Suggested refactor sketch

+function writeRegistry(nemoclawDir, sandboxes, defaultSandbox) {
+  fs.writeFileSync(
+    path.join(nemoclawDir, "sandboxes.json"),
+    JSON.stringify({ sandboxes, defaultSandbox }),
+    { mode: 0o600 }
+  );
+}
+
+function writeSession(nemoclawDir, overrides = {}) {
+  const base = {
+    version: 1,
+    sessionId: "session-1",
+    resumable: true,
+    status: "complete",
+    mode: "interactive",
+    startedAt: "2026-03-31T00:00:00.000Z",
+    updatedAt: "2026-03-31T00:00:00.000Z",
+    sandboxName: "alpha",
+    provider: "nvidia-prod",
+    model: "nvidia/nemotron-3-super-120b-a12b",
+    policyPresets: ["pypi"],
+    steps: {},
+  };
+  fs.writeFileSync(
+    path.join(nemoclawDir, "onboard-session.json"),
+    JSON.stringify({ ...base, ...overrides }, null, 2),
+    { mode: 0o600 }
+  );
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 349 - 845, Extract the repeated test setup
into helpers to reduce duplication: create functions like
writeRegistry(nemoclawDir, data) to write sandboxes.json,
writeSession(nemoclawDir, sessionPayload) to write onboard-session.json, and
writeOpenshellShim(localBin, scriptVariants) to create the openshell shim;
update tests that call runWithEnv to invoke these helpers (replace the inline
fs.writeFileSync blocks and repeated JSON payloads) so each test only specifies
the varying parts (e.g., sandbox names, sandbox list output lines, markerFile)
while the helper handles file creation, modes, and common fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.js`:
- Around line 349-845: Extract the repeated test setup into helpers to reduce
duplication: create functions like writeRegistry(nemoclawDir, data) to write
sandboxes.json, writeSession(nemoclawDir, sessionPayload) to write
onboard-session.json, and writeOpenshellShim(localBin, scriptVariants) to create
the openshell shim; update tests that call runWithEnv to invoke these helpers
(replace the inline fs.writeFileSync blocks and repeated JSON payloads) so each
test only specifies the varying parts (e.g., sandbox names, sandbox list output
lines, markerFile) while the helper handles file creation, modes, and common
fields.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4bf4b20a-3871-4463-a533-c2adbc67ee8f

📥 Commits

Reviewing files that changed from the base of the PR and between 76b210f and b9cc576.

📒 Files selected for processing (1)

test/cli.test.js

fix: recover logs and reboot cli flows

d564ea2

kjw3 marked this pull request as ready for review March 31, 2026 14:36

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

bin/nemoclaw.js Show resolved Hide resolved

bin/nemoclaw.js Outdated Show resolved Hide resolved

kjw3 added 2 commits March 31, 2026 10:53

fix: preserve recovered sandbox metadata and log exits

afbd703

Merge branch 'main' into fix/logs-follow-reconnect-reboot

417fc71

ericksoa reviewed Mar 31, 2026

View reviewed changes

kjw3 marked this pull request as draft March 31, 2026 15:59

kjw3 added 2 commits March 31, 2026 12:04

fix(cli): recover named sandbox connects without reconnect

5191d0d

Merge branch 'main' into fix/logs-follow-reconnect-reboot

888b19e

kjw3 marked this pull request as ready for review March 31, 2026 17:28

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

Merge branch 'main' into fix/logs-follow-reconnect-reboot

09a1985

kjw3 requested a review from brandonpelfrey March 31, 2026 17:38

ericksoa approved these changes Mar 31, 2026

View reviewed changes

fix(cli): validate recovered sandbox names

76b210f

kjw3 removed the request for review from brandonpelfrey March 31, 2026 17:51

Merge branch 'main' into fix/logs-follow-reconnect-reboot

b9cc576

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

kjw3 added 2 commits March 31, 2026 15:11

Merge branch 'main' into fix/logs-follow-reconnect-reboot

958a8e5

test(cli): restore coverage ratchet for recovery changes

16393b8

kjw3 added 3 commits March 31, 2026 15:47

Merge branch 'main' into fix/logs-follow-reconnect-reboot

8b24a8a

test(cli): fix CommonJS credential import typing

124a1ca

Merge branch 'main' into fix/logs-follow-reconnect-reboot

28be134

kjw3 merged commit 1b8e6b9 into main Mar 31, 2026
8 checks passed

kjw3 deleted the fix/logs-follow-reconnect-reboot branch March 31, 2026 20:13

Conversation

kjw3 commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Why this shape

Changes

Logs

Reboot recovery

Connect UX

Cross-platform validation

mac

brev-cpu (kj-nemoclaw-cpu-test)

brev-gpu (kj-nemoclaw-l40s-test)

spark (spark-d8c8)

Credits

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kjw3 commented Mar 31, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

kjw3 commented Mar 31, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

kjw3 commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

brev-cpu (`kj-nemoclaw-cpu-test`)

brev-gpu (`kj-nemoclaw-l40s-test`)

spark (`spark-d8c8`)

coderabbitai bot commented Mar 31, 2026 •

edited

Loading