Skip to content

fix(cli): restore logs streaming and reboot recovery ux#1187

Merged
kjw3 merged 13 commits intomainfrom
fix/logs-follow-reconnect-reboot
Mar 31, 2026
Merged

fix(cli): restore logs streaming and reboot recovery ux#1187
kjw3 merged 13 commits intomainfrom
fix/logs-follow-reconnect-reboot

Conversation

@kjw3
Copy link
Copy Markdown
Contributor

@kjw3 kjw3 commented Mar 31, 2026

Summary

  • restore nemoclaw <name> logs --follow by mapping the user-facing --follow flag to the OpenShell --tail flag
  • add compatibility guidance when the installed OpenShell CLI is too old for NemoClaw log streaming
  • recover local sandbox inventory from the last onboard session so sandboxes do not appear to disappear after a reboot
  • import additional live sandbox names from the current OpenShell gateway during list recovery
  • make nemoclaw <sandbox-name> connect recover missing local registry state directly instead of introducing a separate top-level reconnect command

This absorbs the useful parts of #1148 and the recovery ideas from #960, but keeps the product surface on the command users already expect to reach for after a reboot: nemoclaw <name> connect.

Issues

Fixes #1146
Fixes #1165
Addresses #990
Addresses #1154

Why this shape

  • #1148 is the right fix for the logs regression, but too narrow by itself
  • reboot recovery is more than a connect alias; NemoClaw has to rebuild local inventory before connect can succeed consistently
  • users will try nemoclaw <name> connect first, so recovery belongs on that path rather than behind a new top-level command
  • list now repairs stale or missing local sandbox inventory so the user sees recoverable state instead of an empty registry after reboot

Changes

Logs

  • map nemoclaw logs --follow to openshell logs --tail
  • keep --follow as the NemoClaw user-facing flag
  • print directed upgrade guidance for older or incompatible OpenShell log subcommands instead of leaving only raw parser failures
  • preserve normal SIGINT exit semantics for logs --follow

Reboot recovery

  • recover sandbox inventory from ~/.nemoclaw/onboard-session.json when local registry is missing or stale
  • import additional live sandbox names from the active or recovered OpenShell gateway during list recovery
  • preserve policy presets and partial registries during recovery instead of flattening them to an empty/default state
  • keep registry writes serialized through the existing advisory lock plus atomic rename path in bin/lib/registry.js

Connect UX

  • remove the top-level nemoclaw reconnect command
  • make nemoclaw <sandbox-name> connect attempt recovery before falling into the unknown-command path
  • reuse the existing sandboxConnect() flow after recovery instead of introducing a second connection path

Cross-platform validation

mac

  • branch/commit: fix/logs-follow-reconnect-reboot / 5191d0d
  • commands:
    • /Users/kejones/Git/nvidia/NemoClaw-2/node_modules/.bin/vitest run test/cli.test.js -t "connect"
  • result: passed (6 connect-focused tests passed, 26 skipped)

brev-cpu (kj-nemoclaw-cpu-test)

  • branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
  • node runtime used explicitly: /home/ubuntu/.nvm/versions/node/v22.22.2/bin/node
  • validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
  • result: passed
  • verified behavior:
    • nemoclaw alpha connect recovered from missing local registry state
    • recovery probed openshell sandbox list
    • connect path then executed openshell sandbox get alpha
    • connect path then executed openshell sandbox connect alpha

brev-gpu (kj-nemoclaw-l40s-test)

  • branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
  • node runtime used explicitly: /home/ubuntu/.nvm/versions/node/v22.22.1/bin/node
  • validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
  • result: passed
  • verified behavior:
    • nemoclaw alpha connect recovered from missing local registry state
    • recovery probed openshell sandbox list
    • connect path then executed openshell sandbox get alpha
    • connect path then executed openshell sandbox connect alpha

spark (spark-d8c8)

  • branch/commit: origin/fix/logs-follow-reconnect-reboot / 5191d0d
  • node runtime used: v22.22.1
  • validation ran in a disposable worktree with a fake openshell shim and last-session recovery seed
  • result: passed
  • verified behavior:
    • nemoclaw alpha connect recovered from missing local registry state
    • recovery probed openshell sandbox list
    • connect path then executed openshell sandbox get alpha
    • connect path then executed openshell sandbox connect alpha

Credits

This branch pulls forward ideas from earlier good-faith work, but reworks them onto the current codebase and recovery model:

  • #1148 by senthilr-nv
    • provided the correct narrow fix for the --follow -> --tail logs regression
  • #960 by WuKongAI-CMU (Peter)
    • provided useful recovery and reconnect test ideas, even though the final product direction here keeps recovery on connect

Notes

  • did not merge #1148 directly; absorbed its fix and test intent here
  • did not keep the top-level reconnect command from #960; the final UX puts recovery on nemoclaw <name> connect

Signed-off-by: Kevin Jones kejones@nvidia.com

Summary by CodeRabbit

  • New Features

    • OpenShell version compatibility check for logs with clear upgrade guidance
    • Enhanced sandbox discovery that recovers entries from prior onboarding and live probes
    • Automatic sandbox recovery during connect when a sandbox is missing locally
  • Bug Fixes

    • Corrected logs streaming behavior for follow/non-follow modes and improved error handling
  • Tests

    • Added CLI tests for logs follow behavior, registry recovery/persistence, connect recovery, and SIGINT handling

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Async registry-recovery seeded from onboard session and optional OpenShell probe; OpenShell semver parsing and logs compatibility gate; --follow translated to --tail with modified spawn/stdio behavior for logs; pre-connect recovery added. Tests expanded for logs, recovery, and SIGINT behavior.

Changes

Cohort / File(s) Summary
CLI core: logs & recovery
bin/nemoclaw.js
Added async recoverRegistryEntries() to rebuild/seed sandboxes.json from onboard-session.json and optionally probe OpenShell; list now awaits async recovery; connect attempts pre-dispatch recovery when a named sandbox is missing. Added OpenShell semver parsing and MIN_LOGS_OPENSHELL_VERSION gating; rewrote sandboxLogs() to map --follow--tail, change spawn/stdio handling for follow vs non-follow, detect incompatible OpenShell behavior and print upgrade/usage guidance or exit.
Tests: CLI dispatch & recovery
test/cli.test.js
Expanded tests to assert logs --follow is forwarded as --tail, added non-follow logs coverage, added compatibility/error test for older OpenShell exiting with code 1 and printing upgrade guidance, added registry recovery tests for list and connect (seed from onboard-session.json, optional openshell sandbox list probe, skip invalid names), and a SIGINT semantics test for logs --follow.
Test harness imports
test/...
Added spawnSync import from node:child_process where needed to support signal/exit assertions in tests.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "nemoclaw CLI"
    participant Reg as "Local Registry\n(sandboxes.json)"
    participant Session as "onboard-session.json"
    participant GW as "OpenShell (openshell)"

    CLI->>Reg: Load registry
    alt registry incomplete
        CLI->>Session: Read last onboard session
        Session-->>CLI: Return last onboarded sandbox info
        CLI->>Reg: Seed missing sandbox metadata
        CLI->>GW: Optional: `openshell sandbox list`
        GW-->>CLI: Return live sandbox list
        CLI->>Reg: Upsert gateway entries & update default
        Reg-->>CLI: Persist recovered registry
    end
    CLI->>CLI: Parse `logs <name> --follow`
    CLI->>CLI: Check OpenShell version (semver)
    alt supports --tail
        CLI->>GW: Run `openshell logs <name> --tail` (stream / spawn handling)
        GW-->>CLI: Stream output / exit
    else incompatible
        CLI-->>User: Print compatibility guidance and exit(1)
    end
    CLI->>CLI: Pre-dispatch recovery for `connect <name>` if missing
    CLI->>GW: `openshell sandbox connect <name>`
    GW-->>CLI: Connection result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I stitched lost sandboxes from yesterday's file,

swapped your follow for a tail with a smile.
I probed the gateway, coaxed registries back,
handled SIGINT so your streams won't crack.
Hop in — recovered logs are ready to track.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title concisely summarizes the main changes: restoring logs streaming functionality and improving recovery UX, which aligns with the primary objectives of fixing --follow flag handling and implementing registry recovery.
Linked Issues check ✅ Passed The PR fully addresses all coding requirements from #1146 and #1165: maps --follow to --tail [#1146,#1165], maintains kubectl-style UX [#1165], provides compatibility guidance [#1165], updates tests to assert --tail translation [#1146], implements registry recovery for list/connect [PR objectives], and preserves SIGINT semantics [PR objectives].
Out of Scope Changes check ✅ Passed All changes are in-scope and aligned with objectives: OpenShell version compatibility handling, --follow to --tail translation, registry recovery from onboard-session.json, connect pre-dispatch recovery, and comprehensive test coverage directly support the stated PR goals.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/logs-follow-reconnect-reboot

Comment @coderabbitai help to get the list of available commands and usage tips.

@kjw3 kjw3 marked this pull request as ready for review March 31, 2026 14:36
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/cli.test.js (1)

107-148: ⚠️ Potential issue | 🟡 Minor

Also assert that --follow never reaches OpenShell.

This still passes for openshell logs alpha --follow --tail, which is the exact compatibility regression this change is trying to prevent.

🧪 Suggested assertion
     const r = runWithEnv("alpha logs --follow", {
       HOME: home,
       PATH: `${localBin}:${process.env.PATH || ""}`,
     });

     expect(r.code).toBe(0);
-    expect(fs.readFileSync(markerFile, "utf8")).toContain("logs alpha --tail");
+    const recordedArgs = fs.readFileSync(markerFile, "utf8");
+    expect(recordedArgs).toContain("logs alpha --tail");
+    expect(recordedArgs).not.toContain("--follow");
   });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 107 - 148, Update the test "maps --follow to
openshell --tail" to also assert that the spawned openshell never receives the
original --follow flag: after reading markerFile (written by the fake openshell
script) add an assertion such as expect(fs.readFileSync(markerFile,
"utf8")).not.toContain("--follow") (or assert equality to the exact expected
args string like "logs alpha --tail") so runWithEnv/openshell invocation does
not include "--follow".
🧹 Nitpick comments (1)
test/cli.test.js (1)

354-528: Add a partial-registry recovery case.

These tests only start from an empty sandboxes.json. The current recovery bug shows up when one local sandbox already exists and list is supposed to merge the session sandbox or extra live names, so this suite will not catch that regression yet.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 354 - 528, Add a partial-registry recovery
case by pre-populating .nemoclaw/sandboxes.json with an existing sandbox entry
and asserting the list recovery merges new session/live sandboxes instead of
overwriting; modify the two tests ("recovers a missing registry entry from the
last onboard session during list" and "imports additional live sandboxes into
the registry during list recovery") to create a sandboxes.json before running
runWithEnv("list") (e.g., write JSON with sandboxes: { "gamma": { ... } } and
defaultSandbox: "gamma"), then assert after the run that both the original
"gamma" and the recovered "alpha" (and "beta" in the second test) exist and
defaultSandbox remains correct or is updated as expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 117-125: The recovered sandbox entries drop persisted policy
presets because buildRecoveredSandboxEntry only copies metadata.policies and
ignores metadata.policyPresets; update buildRecoveredSandboxEntry to populate
policies from metadata.policies if present, otherwise fall back to
metadata.policyPresets (and default to []), e.g. set policies =
Array.isArray(metadata.policies) ? metadata.policies :
(Array.isArray(metadata.policyPresets) ? metadata.policyPresets : []); make the
same change in the analogous recovered-entry builder used for agents/recovered
sessions so both sources of presets are preserved.
- Around line 138-146: The current shouldRecoverRegistryEntries incorrectly
prevents recovery when the registry contains any entries even if those entries
are stale or a session/requested sandbox should be re-seeded; update the logic
in shouldRecoverRegistryEntries so that hasRecoverySeed is true when there is a
session.sandboxName or a requestedSandboxName (not just when
current.sandboxes.length > 0), and compute shouldRecover as true whenever the
registry is empty OR the requested sandbox is missing OR a session/requested
sandbox is present that isn’t represented in current.sandboxes; change the
definitions around missingRequestedSandbox, hasRecoverySeed and shouldRecover in
the shouldRecoverRegistryEntries function to reflect this (use
current.sandboxes.some(...) to detect presence and trigger recovery when that
check fails even if sandboxes.length > 0).
- Around line 854-880: The current spawnSync result handling in the logs command
block treats a SIGINT interrupt as a failure; modify the error/exit path to call
the existing exitWithSpawnResult(result) helper (used elsewhere) instead of
directly printing "Command failed" and process.exit when result.status is null
or result.signal is set, so signals are converted to proper exit codes; keep the
existing compatibility check
(printOldLogsCompatibilityGuidance(installedVersion) when the combined output
matches the old-logs patterns or version is too old) but invoke
exitWithSpawnResult(result) for non-zero/null status handling to mirror the
other call sites.

---

Outside diff comments:
In `@test/cli.test.js`:
- Around line 107-148: Update the test "maps --follow to openshell --tail" to
also assert that the spawned openshell never receives the original --follow
flag: after reading markerFile (written by the fake openshell script) add an
assertion such as expect(fs.readFileSync(markerFile,
"utf8")).not.toContain("--follow") (or assert equality to the exact expected
args string like "logs alpha --tail") so runWithEnv/openshell invocation does
not include "--follow".

---

Nitpick comments:
In `@test/cli.test.js`:
- Around line 354-528: Add a partial-registry recovery case by pre-populating
.nemoclaw/sandboxes.json with an existing sandbox entry and asserting the list
recovery merges new session/live sandboxes instead of overwriting; modify the
two tests ("recovers a missing registry entry from the last onboard session
during list" and "imports additional live sandboxes into the registry during
list recovery") to create a sandboxes.json before running runWithEnv("list")
(e.g., write JSON with sandboxes: { "gamma": { ... } } and defaultSandbox:
"gamma"), then assert after the run that both the original "gamma" and the
recovered "alpha" (and "beta" in the second test) exist and defaultSandbox
remains correct or is updated as expected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf043a1-fc52-4733-96ff-8202a0be56a7

📥 Commits

Reviewing files that changed from the base of the PR and between b34c4f4 and d564ea2.

📒 Files selected for processing (2)
  • bin/nemoclaw.js
  • test/cli.test.js

Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work absorbing the useful parts of #1148 and #960 — the recovery model is thoughtful and the test coverage is solid (10 new tests). A few questions inline about behavioral edges.

cwd: ROOT,
env: process.env,
encoding: "utf-8",
stdio: follow ? ["ignore", "inherit", "pipe"] : ["ignore", "pipe", "pipe"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the --follow--tail translation — this clearly fixes the regression.

One thing I wanted to think through with you on the stdio configuration: in follow mode, stderr is piped (["ignore", "inherit", "pipe"]), which means any errors openshell writes during a long-running streaming session won't be visible to the user until after they Ctrl-C. Could that mask a mid-stream error in a way that would be confusing? Would it be worth inheriting stderr too and handling the version-compatibility detection a different way (e.g., checking the version upfront only, which you already do)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I kept stderr piped in follow mode for this PR because the compatibility path still needs to inspect parser failures from older OpenShell builds, and those show up on stderr before a stream is established. Since we now check the installed version up front as well, I agree the remaining tradeoff is mid-stream stderr visibility versus preserving that fallback detection. I chose to keep the existing detection path stable here and would treat inheriting stderr in follow mode as a follow-up cleanup if we want to simplify around the version gate alone.

};
}

function upsertRecoveredSandbox(name, metadata = {}) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice resolveReconnectSandboxName runs validateName() on sandbox names that come from user input — nice. Looking at upsertRecoveredSandbox, though, names sourced from onboard-session.json or the gateway sandbox list output flow straight to registry.registerSandbox() without the same validation. Is there a reason to skip validation here, or would it be worth being consistent across both paths?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair consistency question. I originally skipped validation there because those names come from NemoClaw/OpenShell-managed sources rather than raw user input, and the gateway path is already constrained by OpenShell sandbox naming rules. I do agree that validating before registry writes would be a reasonable hardening step; I just did not want to expand this PR again after the UX direction changed to remove reconnect and move recovery onto connect.


function listSandboxes() {
const { sandboxes, defaultSandbox } = registry.listSandboxes();
async function listSandboxes() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thoughtful approach to the reboot problem — recovering the registry transparently during list so the user doesn't have to know about it.

One behavioral shift worth calling out: list was previously a pure read; now it can write to sandboxes.json as a side effect. Two questions come to mind:

  1. If someone runs nemoclaw list while nemoclaw onboard is in progress on another terminal, could both be writing to sandboxes.json concurrently? How does the registry lock handle that?
  2. Would it be useful to mention the write-on-read behavior in the help text or a doc note, so users who script around list aren't surprised?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the registry implementation before answering this. The write path is serialized through the existing advisory lock in bin/lib/registry.js (mkdir lock dir + owner PID + stale-lock handling) and the file write itself is tmp-file-plus-rename, so list and onboard should serialize rather than clobber each other. I also updated the PR body to call out explicitly that list now repairs local inventory on read, since that side effect is worth documenting.

bin/nemoclaw.js Outdated
exitWithSpawnResult(result);
}

async function reconnect(args = []) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconnect flow is clean — I like that it reuses sandboxConnect rather than inventing a new connection path.

Thinking through the call chain: reconnect calls recoverRegistryEntries() which may probe the gateway (via recoverRegistryFromLiveGatewayrecoverNamedGatewayRuntime), then calls sandboxConnect()ensureLiveSandboxOrExit() which can call recoverNamedGatewayRuntime() again. In the worst case, is that a double gateway probe? If so, is the cost low enough not to worry about, or would it be worth short-circuiting the second one if recovery already succeeded?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agreed with the UX concern behind this thread and changed the shape after that review pass: the top-level reconnect command is gone now, and recovery moved onto nemoclaw <name> connect. So this specific double-probe path is no longer part of the PR. The updated flow is: dispatch sees <name> connect, attempts recovery for that requested sandbox, and if recovery succeeds it drops straight into the existing sandboxConnect() path.

@kjw3 kjw3 marked this pull request as draft March 31, 2026 15:59
@kjw3 kjw3 marked this pull request as ready for review March 31, 2026 17:28
@kjw3
Copy link
Copy Markdown
Contributor Author

kjw3 commented Mar 31, 2026

Updated this PR to follow the product direction from the review discussion:

  • removed the top-level nemoclaw reconnect command
  • moved recovery onto nemoclaw <sandbox-name> connect, which now attempts local inventory recovery before the unknown-command path
  • kept list as the recovery entry point for rebuilding local inventory after reboot

I also refreshed the PR body to match the new shape and replaced the old validation notes with the current 5191d0d validation:

  • mac: connect-focused vitest pass
  • brev-cpu: passed using explicit Node v22.22.2 from ~/.nvm
  • brev-gpu: passed using explicit Node v22.22.1 from ~/.nvm
  • spark: passed using Node v22.22.1

For the cross-platform environment checks, I validated the recovery path directly: nemoclaw alpha connect recovers missing local registry state, probes openshell sandbox list, and then reaches openshell sandbox get/connect alpha on the updated branch.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
bin/nemoclaw.js (1)

132-139: ⚠️ Potential issue | 🟠 Major

Validate recovered sandbox names before persisting them.

Line 132 upserts names from recovery sources without the same validation used for user input paths. This can persist invalid names into sandboxes.json and create follow-on failures in command flows that assume validated names.

🛠️ Suggested hardening
+function isValidRecoveredSandboxName(name) {
+  return typeof name === "string" && /^[a-z0-9][a-z0-9-]*[a-z0-9]$/.test(name);
+}
+
 function upsertRecoveredSandbox(name, metadata = {}) {
+  if (!isValidRecoveredSandboxName(name)) {
+    return false;
+  }
   const entry = buildRecoveredSandboxEntry(name, metadata);
   if (registry.getSandbox(name)) {
     registry.updateSandbox(name, entry);
     return false;
   }
   registry.registerSandbox(entry);
   return true;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 132 - 139, upsertRecoveredSandbox currently
persists recovered names without validation; before calling
buildRecoveredSandboxEntry or registry.registerSandbox/updateSandbox, run the
recovered name through the same sandbox name validation/sanitization used for
user-entered paths (e.g. the project's validate/sanitize function for sandbox
names) and skip or normalize any invalid names; specifically, validate the name
variable at the top of upsertRecoveredSandbox, return false or log and skip when
invalid, and only call buildRecoveredSandboxEntry, registry.updateSandbox or
registry.registerSandbox for names that pass validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 117-129: buildRecoveredSandboxEntry correctly includes
nimContainer in the recovered entry, but registerSandbox in bin/lib/registry.js
does not persist nimContainer, so recovered entries lose their container
linkage; update the registerSandbox(sandbox) implementation to accept and store
sandbox.nimContainer (alongside name, model, provider, gpuEnabled, policies)
when creating/persisting the sandbox record and ensure any normalization/DB
write paths (e.g., the code paths that previously handled
model/provider/policies) include nimContainer so the value from
buildRecoveredSandboxEntry is retained after registry.registerSandbox(entry) is
called.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 132-139: upsertRecoveredSandbox currently persists recovered names
without validation; before calling buildRecoveredSandboxEntry or
registry.registerSandbox/updateSandbox, run the recovered name through the same
sandbox name validation/sanitization used for user-entered paths (e.g. the
project's validate/sanitize function for sandbox names) and skip or normalize
any invalid names; specifically, validate the name variable at the top of
upsertRecoveredSandbox, return false or log and skip when invalid, and only call
buildRecoveredSandboxEntry, registry.updateSandbox or registry.registerSandbox
for names that pass validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e4c2dbb-46da-4781-bbcf-703787e06543

📥 Commits

Reviewing files that changed from the base of the PR and between afbd703 and 888b19e.

📒 Files selected for processing (2)
  • bin/nemoclaw.js
  • test/cli.test.js

Comment on lines +117 to +129
function buildRecoveredSandboxEntry(name, metadata = {}) {
return {
name,
model: metadata.model || null,
provider: metadata.provider || null,
gpuEnabled: metadata.gpuEnabled === true,
policies: Array.isArray(metadata.policies)
? metadata.policies
: Array.isArray(metadata.policyPresets)
? metadata.policyPresets
: [],
nimContainer: metadata.nimContainer || null,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Recovered nimContainer metadata is still lost on first insert.

Line 128 includes nimContainer, but Line 138 inserts through registry.registerSandbox(entry), and bin/lib/registry.js currently does not persist nimContainer in registerSandbox(). Result: newly recovered entries lose container linkage immediately.

🧩 Cross-file fix (`bin/lib/registry.js`)
 function registerSandbox(entry) {
   return withLock(() => {
     const data = load();
     const normalized = normalizeEntry(entry);
     if (!normalized) {
       throw new Error(`Invalid sandbox entry: ${entry.name}`);
     }
     data.sandboxes[normalized.name] = {
       name: normalized.name,
       model: normalized.model,
       provider: normalized.provider,
       gpuEnabled: normalized.gpuEnabled,
       policies: normalized.policies || [],
+      nimContainer: normalized.nimContainer,
     };
     if (!data.defaultSandbox) {
       data.defaultSandbox = entry.name;
     }
     save(data);
   });
 }

Also applies to: 132-139

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 117 - 129, buildRecoveredSandboxEntry correctly
includes nimContainer in the recovered entry, but registerSandbox in
bin/lib/registry.js does not persist nimContainer, so recovered entries lose
their container linkage; update the registerSandbox(sandbox) implementation to
accept and store sandbox.nimContainer (alongside name, model, provider,
gpuEnabled, policies) when creating/persisting the sandbox record and ensure any
normalization/DB write paths (e.g., the code paths that previously handled
model/provider/policies) include nimContainer so the value from
buildRecoveredSandboxEntry is retained after registry.registerSandbox(entry) is
called.

@kjw3 kjw3 requested a review from brandonpelfrey March 31, 2026 17:38
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — all prior comments addressed. The recovery pipeline is well-factored, test coverage is solid (10 new tests including partial-registry cases), and the connect-based recovery is a cleaner UX than a separate reconnect command.

Two minor follow-ups for a future PR, neither blocking:

  • Inheriting stderr in follow mode once the version gate is the sole compat check
  • Validating recovered sandbox names before registry writes for consistency

Approving.

@kjw3 kjw3 removed the request for review from brandonpelfrey March 31, 2026 17:51
@kjw3
Copy link
Copy Markdown
Contributor Author

kjw3 commented Mar 31, 2026

Thanks, Aaron. I'll merge once all CI goes green

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
bin/nemoclaw.js (1)

149-163: ⚠️ Potential issue | 🟠 Major

Partial registries can still skip the live-gateway rebuild.

Line 160 only enables recovery when the registry is empty, the requested sandbox is missing, or the session sandbox is missing. If sandboxes.json already contains the session sandbox but is still missing other live sandboxes, Line 718 never reaches recoverRegistryFromLiveGateway(), so nemoclaw list stays incomplete after reboot.

🛠️ Suggested direction
-function shouldRecoverRegistryEntries(current, session, requestedSandboxName) {
+function shouldRecoverRegistryEntries(current, session, requestedSandboxName, { probeLiveGateway = false } = {}) {
   const hasSessionSandbox = Boolean(session?.sandboxName);
   const missingSessionSandbox =
     hasSessionSandbox &&
     !current.sandboxes.some((sandbox) => sandbox.name === session.sandboxName);
   const missingRequestedSandbox =
     Boolean(requestedSandboxName) &&
     !current.sandboxes.some((sandbox) => sandbox.name === requestedSandboxName);
   const hasRecoverySeed = current.sandboxes.length > 0 || hasSessionSandbox || Boolean(requestedSandboxName);
   return {
     missingRequestedSandbox,
     shouldRecover:
       hasRecoverySeed &&
-      (current.sandboxes.length === 0 || missingRequestedSandbox || missingSessionSandbox),
+      (current.sandboxes.length === 0 || missingRequestedSandbox || missingSessionSandbox || probeLiveGateway),
   };
 }

-async function recoverRegistryEntries({ requestedSandboxName = null } = {}) {
+async function recoverRegistryEntries({ requestedSandboxName = null, probeLiveGateway = false } = {}) {
   const current = registry.listSandboxes();
   const session = onboardSession.loadSession();
-  const recoveryCheck = shouldRecoverRegistryEntries(current, session, requestedSandboxName);
+  const recoveryCheck = shouldRecoverRegistryEntries(current, session, requestedSandboxName, { probeLiveGateway });
   if (!recoveryCheck.shouldRecover) {
     return { ...current, recoveredFromSession: false, recoveredFromGateway: 0 };
   }

   const seeded = seedRecoveryMetadata(current, session, requestedSandboxName);
-  const shouldProbeLiveGateway = current.sandboxes.length > 0 || Boolean(session?.sandboxName);
+  const shouldProbeLiveGateway = probeLiveGateway || current.sandboxes.length > 0 || Boolean(session?.sandboxName);
-  const recovery = await recoverRegistryEntries();
+  const recovery = await recoverRegistryEntries({ probeLiveGateway: true });

Also applies to: 226-238, 717-719

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 149 - 163, The current
shouldRecoverRegistryEntries logic misses the case of a partial registry
(registry contains some sandboxes but is missing other live sandboxes), so
change the function signature shouldRecoverRegistryEntries(current, session,
requestedSandboxName, hasMissingLiveSandboxes=false) and include
hasMissingLiveSandboxes in the shouldRecover condition (i.e., set shouldRecover
to true when hasMissingLiveSandboxes is true OR when current.sandboxes.length
=== 0 OR missingRequestedSandbox OR missingSessionSandbox). Update all call
sites (the code around the uses at the blocks referenced near the original diffs
and where recoverRegistryFromLiveGateway() is invoked) to compute/pass
hasMissingLiveSandboxes (by comparing current.sandboxes to the live-gateway list
before calling this helper) so recoverRegistryFromLiveGateway() runs for partial
registries as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 217-223: The recovery logic in applyRecoveredDefault currently
prefers requestedSandboxName even when a currentDefaultSandbox exists, causing
silent global default changes; change it so the default is only set when there
is no currentDefaultSandbox (i.e., only use requestedSandboxName or
session.sandboxName when currentDefaultSandbox is falsy), then call
registry.setDefault(preferredDefault) only if preferredDefault is truthy and
exists in recovered.sandboxes; update the same pattern at the other occurrence
(around lines 1052-1057) to avoid silently retargeting the global default during
connect recovery.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Around line 149-163: The current shouldRecoverRegistryEntries logic misses the
case of a partial registry (registry contains some sandboxes but is missing
other live sandboxes), so change the function signature
shouldRecoverRegistryEntries(current, session, requestedSandboxName,
hasMissingLiveSandboxes=false) and include hasMissingLiveSandboxes in the
shouldRecover condition (i.e., set shouldRecover to true when
hasMissingLiveSandboxes is true OR when current.sandboxes.length === 0 OR
missingRequestedSandbox OR missingSessionSandbox). Update all call sites (the
code around the uses at the blocks referenced near the original diffs and where
recoverRegistryFromLiveGateway() is invoked) to compute/pass
hasMissingLiveSandboxes (by comparing current.sandboxes to the live-gateway list
before calling this helper) so recoverRegistryFromLiveGateway() runs for partial
registries as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: efe6fa8b-b153-4fb7-ab88-c4aac028fe75

📥 Commits

Reviewing files that changed from the base of the PR and between 888b19e and 76b210f.

📒 Files selected for processing (2)
  • bin/nemoclaw.js
  • test/cli.test.js

Comment on lines +174 to +188
metadataByName.set(
session.sandboxName,
buildRecoveredSandboxEntry(session.sandboxName, {
model: session.model || null,
provider: session.provider || null,
nimContainer: session.nimContainer || null,
policyPresets: session.policyPresets || null,
})
);
const sessionSandboxMissing = !current.sandboxes.some((sandbox) => sandbox.name === session.sandboxName);
const shouldRecoverSessionSandbox =
current.sandboxes.length === 0 || sessionSandboxMissing || requestedSandboxName === session.sandboxName;
if (shouldRecoverSessionSandbox) {
recoveredFromSession = upsertRecoveredSandbox(session.sandboxName, metadataByName.get(session.sandboxName));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

REGISTRY_FILE="$(fd '^registry\.js$' bin | head -n1)"
echo "Inspecting: $REGISTRY_FILE"
rg -n -C6 'function normalizeEntry|function updateSandbox|gpuEnabled|policies|nimContainer' "$REGISTRY_FILE"

Repository: NVIDIA/NemoClaw

Length of output: 929


🏁 Script executed:

cat -n bin/lib/registry.js | sed -n '165,200p'

Repository: NVIDIA/NemoClaw

Length of output: 1211


🏁 Script executed:

cat -n bin/nemoclaw.js | sed -n '174,215p'

Repository: NVIDIA/NemoClaw

Length of output: 1878


🏁 Script executed:

rg -n -A15 'function buildRecoveredSandboxEntry|function upsertRecoveredSandbox' bin/nemoclaw.js

Repository: NVIDIA/NemoClaw

Length of output: 999


Session recovery can reset critical sandbox fields when updating existing sandboxes.

Line 142 in bin/lib/registry.js calls updateSandbox() with an entry created by buildRecoveredSandboxEntry(). Although updateSandbox() uses Object.assign() (a merge operation at line 172), the recovered entry is built from session snapshots that don't include all original fields. Specifically:

  • buildRecoveredSandboxEntry() defaults gpuEnabled to false (line 122) if missing from metadata
  • It defaults policies to [] (line 123–127) if missing from metadata

When seedRecoveryMetadata() processes a session at lines 174–182 and later passes it through upsertRecoveredSandbox(), these incomplete snapshots will overwrite existing sandbox fields with defaults, silently resetting gpuEnabled, policies, or nimContainer on already-registered sandboxes.

Preserve existing fields by not overwriting entries in metadataByName when they already exist, or ensure session snapshots only update fields that were actually saved in the session.

Comment on lines +217 to +223
function applyRecoveredDefault(currentDefaultSandbox, requestedSandboxName, session) {
const recovered = registry.listSandboxes();
const preferredDefault = requestedSandboxName || (!currentDefaultSandbox ? session?.sandboxName || null : null);
if (preferredDefault && recovered.sandboxes.some((sandbox) => sandbox.name === preferredDefault)) {
registry.setDefault(preferredDefault);
}
return registry.listSandboxes();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

connect recovery shouldn't silently retarget the default sandbox.

Line 219 prefers requestedSandboxName even when currentDefaultSandbox is already set. That means beta connect can repair the registry and also change later global commands to operate on beta, which is a surprising side effect for a recovery path.

🛠️ Suggested fix
 function applyRecoveredDefault(currentDefaultSandbox, requestedSandboxName, session) {
   const recovered = registry.listSandboxes();
-  const preferredDefault = requestedSandboxName || (!currentDefaultSandbox ? session?.sandboxName || null : null);
+  const preferredDefault = currentDefaultSandbox
+    ? null
+    : requestedSandboxName || session?.sandboxName || null;
   if (preferredDefault && recovered.sandboxes.some((sandbox) => sandbox.name === preferredDefault)) {
     registry.setDefault(preferredDefault);
   }
   return registry.listSandboxes();
 }

Also applies to: 1052-1057

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 217 - 223, The recovery logic in
applyRecoveredDefault currently prefers requestedSandboxName even when a
currentDefaultSandbox exists, causing silent global default changes; change it
so the default is only set when there is no currentDefaultSandbox (i.e., only
use requestedSandboxName or session.sandboxName when currentDefaultSandbox is
falsy), then call registry.setDefault(preferredDefault) only if preferredDefault
is truthy and exists in recovered.sandboxes; update the same pattern at the
other occurrence (around lines 1052-1057) to avoid silently retargeting the
global default during connect recovery.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/cli.test.js (1)

349-845: Consider extracting shared test fixtures to reduce duplication.

The repeated onboard-session payloads and openshell shim scaffolding make this section expensive to maintain; a few helpers (e.g., writeRegistry, writeSession, writeOpenshellShim) would reduce drift risk.

♻️ Suggested refactor sketch
+function writeRegistry(nemoclawDir, sandboxes, defaultSandbox) {
+  fs.writeFileSync(
+    path.join(nemoclawDir, "sandboxes.json"),
+    JSON.stringify({ sandboxes, defaultSandbox }),
+    { mode: 0o600 }
+  );
+}
+
+function writeSession(nemoclawDir, overrides = {}) {
+  const base = {
+    version: 1,
+    sessionId: "session-1",
+    resumable: true,
+    status: "complete",
+    mode: "interactive",
+    startedAt: "2026-03-31T00:00:00.000Z",
+    updatedAt: "2026-03-31T00:00:00.000Z",
+    sandboxName: "alpha",
+    provider: "nvidia-prod",
+    model: "nvidia/nemotron-3-super-120b-a12b",
+    policyPresets: ["pypi"],
+    steps: {},
+  };
+  fs.writeFileSync(
+    path.join(nemoclawDir, "onboard-session.json"),
+    JSON.stringify({ ...base, ...overrides }, null, 2),
+    { mode: 0o600 }
+  );
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/cli.test.js` around lines 349 - 845, Extract the repeated test setup
into helpers to reduce duplication: create functions like
writeRegistry(nemoclawDir, data) to write sandboxes.json,
writeSession(nemoclawDir, sessionPayload) to write onboard-session.json, and
writeOpenshellShim(localBin, scriptVariants) to create the openshell shim;
update tests that call runWithEnv to invoke these helpers (replace the inline
fs.writeFileSync blocks and repeated JSON payloads) so each test only specifies
the varying parts (e.g., sandbox names, sandbox list output lines, markerFile)
while the helper handles file creation, modes, and common fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/cli.test.js`:
- Around line 349-845: Extract the repeated test setup into helpers to reduce
duplication: create functions like writeRegistry(nemoclawDir, data) to write
sandboxes.json, writeSession(nemoclawDir, sessionPayload) to write
onboard-session.json, and writeOpenshellShim(localBin, scriptVariants) to create
the openshell shim; update tests that call runWithEnv to invoke these helpers
(replace the inline fs.writeFileSync blocks and repeated JSON payloads) so each
test only specifies the varying parts (e.g., sandbox names, sandbox list output
lines, markerFile) while the helper handles file creation, modes, and common
fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4bf4b20a-3871-4463-a533-c2adbc67ee8f

📥 Commits

Reviewing files that changed from the base of the PR and between 76b210f and b9cc576.

📒 Files selected for processing (1)
  • test/cli.test.js

@kjw3 kjw3 merged commit 1b8e6b9 into main Mar 31, 2026
8 checks passed
@kjw3 kjw3 deleted the fix/logs-follow-reconnect-reboot branch March 31, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants