fix: clear stale SSH host keys after gateway restart by dknos · Pull Request #1087 · NVIDIA/NemoClaw

dknos · 2026-03-30T07:45:32Z

Summary

Purges stale openshell-* entries from ~/.ssh/known_hosts when the gateway is destroyed during nemoclaw onboard, so freshly regenerated mTLS certificates aren't rejected by leftover host key fingerprints
Switches debug.sh sandbox SSH connections to use ephemeral known_hosts files with StrictHostKeyChecking=accept-new instead of the persistent system file

Root cause

After a Docker container restart (or Mac reboot), the gateway regenerates its mTLS certificates. However, ~/.ssh/known_hosts still contains the old host key fingerprint for openshell-<sandbox>. SSH rejects the new certificate with "handshake verification failed" — the only workaround was destroying and recreating the entire sandbox.

Test plan

Reboot Docker / restart the container
Run nemoclaw <name> status — should reconnect without errors
Run nemoclaw onboard on a fresh setup — verify SSH connections succeed
Run nemoclaw debug — verify sandbox introspection works

Fixes #768

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Automatically remove stale SSH host keys before launching a new sandbox gateway to reduce connection failures from leftover entries.
- Improve sandbox diagnostics SSH behavior by using a temporary known-hosts file, cleaning it up on exit, and switching to stricter host key verification for safer connections.

After a Docker container restart, the gateway regenerates its mTLS certificates but stale host key entries in ~/.ssh/known_hosts cause SSH handshake verification failures. This fix: - Purges old openshell host key entries from known_hosts when the gateway is destroyed during onboard - Uses ephemeral known_hosts files for sandbox SSH connections in debug.sh to avoid accumulating stale entries Fixes NVIDIA#768 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-30T07:50:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eab4c422-7f9c-4082-9628-cc487aba1420

📥 Commits

Reviewing files that changed from the base of the PR and between e28ce48 and 9a26302.

📒 Files selected for processing (1)

bin/lib/onboard.js

✅ Files skipped from review due to trivial changes (1)

bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Removes stale SSH host key entries tied to gateway names before building gateway args, and updates the sandbox debug script to use a temporary known_hosts file with UserKnownHostsFile and StrictHostKeyChecking=accept-new, plus cleanup on exit.

Changes

Cohort / File(s)	Summary
Gateway SSH host key cleanup `bin/lib/onboard.js`	Before constructing gateway args, runs `ssh-keygen -R openshell-${GATEWAY_NAME}` (output/errors suppressed) and, if `~/.ssh/known_hosts` exists, filters out lines whose hostname field contains `openshell-`, writing cleaned contents back (read/write errors suppressed).
Sandbox debug SSH behavior `scripts/debug.sh`	Creates a temp known-hosts file passed via `UserKnownHostsFile=$SANDBOX_SSH_KNOWN`, switches `StrictHostKeyChecking=no` → `StrictHostKeyChecking=accept-new`, updates SSH options arrays to include the known-hosts file, and adds a trap to remove the temp file on exit.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through ~/.ssh at night,
I scratched out stale keys by moonlight,
A temp-file crinkled, tidy and new,
"Accept-new" I whispered — the gateway flew,
Now tunnels hum and I nibble a carrot. 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: clearing stale SSH host keys after gateway restart, which aligns with the primary fix in the pull request.
Linked Issues check	✅ Passed	The PR implements the core requirements from issue `#768`: removes stale SSH host key entries, uses ephemeral known_hosts in debug.sh, and eliminates the need for sandbox destroy/recreate after restarts.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the SSH host key synchronization issue: gateway restart SSH key cleanup and debug.sh ephemeral known_hosts implementation address the root cause.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

1289-1291: Narrow known_hosts pruning to host-field matches only.

Current filtering removes any line containing openshell-, which can delete unrelated entries. Match only hostnames in the first known_hosts field.

♻️ Safer filter

-      const cleaned = kh.split("\n").filter(l => !l.includes("openshell-")).join("\n");
+      const cleaned = kh
+        .split("\n")
+        .filter((line) => {
+          const trimmed = line.trim();
+          if (!trimmed || trimmed.startsWith("#")) return true;
+          const hostField = trimmed.split(/\s+/)[0];
+          return !hostField
+            .split(",")
+            .some((host) => host.startsWith("openshell-"));
+        })
+        .join("\n");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1289 - 1291, The current pruning removes any
line containing "openshell-" anywhere; change the filter so it only inspects the
first whitespace-separated field (the host field) of each known_hosts line:
split each line on /\s+/ to get hostField = parts[0], split hostField by ','
into hostnames, and remove the line only if at least one hostname in that first
field matches the openshell pattern (e.g., startsWith "openshell-" or matches
/^openshell-/). Update the filter used to compute cleaned (currently l =>
!l.includes("openshell-")) to perform this host-field check and keep the
existing write-back behavior for knownHostsPath when cleaned !== kh.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1280-1284: The try/catch block calling execFileSync("ssh-keygen",
["-R", `openshell-${SANDBOX_NAME}`], ...) references an undefined identifier
SANDBOX_NAME so the removal never runs and the ReferenceError is swallowed;
replace SANDBOX_NAME with the module's actual sandbox name variable (e.g., the
constant or parameter used elsewhere such as sandboxName, SANDBOX, or similar)
in that block (locate the identifier used for the gateway/sandbox elsewhere in
this file) and update the catch to at least log the error via the existing
logger (or console) instead of an empty catch so failures are visible.

In `@scripts/debug.sh`:
- Around line 259-260: Temporary known_hosts file created via
SANDBOX_SSH_KNOWN=$(mktemp ...) is never removed; add a cleanup that removes
SANDBOX_SSH_KNOWN on exit/failure. Implement a cleanup function (e.g.,
cleanup_sandbox_known) that checks and rm -f "$SANDBOX_SSH_KNOWN" and register
it with trap 'cleanup_sandbox_known' EXIT (and optionally INT TERM) so the file
created for SANDBOX_SSH_OPTS is always removed when the script exits or is
interrupted.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1289-1291: The current pruning removes any line containing
"openshell-" anywhere; change the filter so it only inspects the first
whitespace-separated field (the host field) of each known_hosts line: split each
line on /\s+/ to get hostField = parts[0], split hostField by ',' into
hostnames, and remove the line only if at least one hostname in that first field
matches the openshell pattern (e.g., startsWith "openshell-" or matches
/^openshell-/). Update the filter used to compute cleaned (currently l =>
!l.includes("openshell-")) to perform this host-field check and keep the
existing write-back behavior for knownHostsPath when cleaned !== kh.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9c5a0ac4-4fce-49ba-b643-cb13c5769a20

📥 Commits

Reviewing files that changed from the base of the PR and between f59f58e and 51a1564.

📒 Files selected for processing (2)

bin/lib/onboard.js
scripts/debug.sh

bin/lib/onboard.js

coderabbitai

♻️ Duplicate comments (1)

scripts/debug.sh (1)

287-288: ⚠️ Potential issue | 🟡 Minor

Temporary known_hosts file is still not cleaned up.

Line 287 creates SANDBOX_SSH_KNOWN, but cleanup() never removes it, so temp files are left behind after each run.

🧹 Proposed fix

 SANDBOX_SSH_CONFIG=""
+SANDBOX_SSH_KNOWN=""
 cleanup() {
   rm -rf "$COLLECT_DIR"
   if [ -n "$SANDBOX_SSH_CONFIG" ]; then
     rm -f "$SANDBOX_SSH_CONFIG"
   fi
+  if [ -n "$SANDBOX_SSH_KNOWN" ]; then
+    rm -f "$SANDBOX_SSH_KNOWN"
+  fi
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/debug.sh` around lines 287 - 288, The temp file created in
SANDBOX_SSH_KNOWN is never removed; update the cleanup() function to
unlink/remove the SANDBOX_SSH_KNOWN file if it exists (and unset the variable)
and ensure the existing trap calls cleanup on EXIT so the temp known_hosts is
deleted on script exit; reference SANDBOX_SSH_KNOWN and cleanup() when making
this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/debug.sh`:
- Around line 287-288: The temp file created in SANDBOX_SSH_KNOWN is never
removed; update the cleanup() function to unlink/remove the SANDBOX_SSH_KNOWN
file if it exists (and unset the variable) and ensure the existing trap calls
cleanup on EXIT so the temp known_hosts is deleted on script exit; reference
SANDBOX_SSH_KNOWN and cleanup() when making this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f129851d-4293-48c4-a156-a17b2911c8dc

📥 Commits

Reviewing files that changed from the base of the PR and between 51a1564 and d30480b.

📒 Files selected for processing (2)

bin/lib/onboard.js
scripts/debug.sh

🚧 Files skipped from review as they are similar to previous changes (1)

bin/lib/onboard.js

…filter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/debug.sh`:
- Around line 287-289: The new trap on EXIT overwrites the existing trap and
prevents cleanup() from running; remove the separate trap command and instead
add logic to the cleanup() function to remove the temporary SANDBOX_SSH_KNOWN
file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any
SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap
'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for
and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when
cleanup() runs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 668e4297-e87d-4a38-84c5-bd3b8867b207

📥 Commits

Reviewing files that changed from the base of the PR and between d30480b and e28ce48.

📒 Files selected for processing (2)

bin/lib/onboard.js
scripts/debug.sh

🚧 Files skipped from review as they are similar to previous changes (1)

bin/lib/onboard.js

coderabbitai · 2026-03-30T11:00:57Z

scripts/debug.sh

+    SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX")
+    trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT
+    SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail printf '\n== trap declarations in scripts/debug.sh ==\n' nl -ba scripts/debug.sh | sed -n '84,95p;284,290p' printf '\n== bash trap behavior repro ==\n' bash -lc ' cleanup(){ echo cleanup; } trap cleanup EXIT trap '\''echo sandbox_known'\'' EXIT '

Repository: NVIDIA/NemoClaw

Length of output: 163

🏁 Script executed:

#!/bin/bash # First, check file size to ensure we can read it safely wc -l scripts/debug.sh # Read the cleanup function and first trap declaration echo "=== Lines 84-95 (cleanup function area) ===" sed -n '84,95p' scripts/debug.sh # Read the SANDBOX_SSH_KNOWN and second trap declaration echo "" echo "=== Lines 284-290 (SANDBOX_SSH_KNOWN area) ===" sed -n '284,290p' scripts/debug.sh # Test bash trap behavior with available tools echo "" echo "=== Bash trap behavior test ===" bash << 'BASHEOF' cleanup(){ echo "cleanup called"; } trap cleanup EXIT trap 'echo sandbox_known' EXIT echo "Exiting..." BASHEOF

Repository: NVIDIA/NemoClaw

Length of output: 955

Don't replace the existing EXIT trap here.

Line 288 overwrites trap cleanup EXIT (line 94), preventing the cleanup function from running on exit. This leaves COLLECT_DIR and SANDBOX_SSH_CONFIG behind in ${TMPDIR_BASE} on every successful run.

Fold the cleanup of SANDBOX_SSH_KNOWN into the cleanup() function instead of creating a separate trap:

Fix

TMPDIR_BASE="${TMPDIR:-/tmp}" COLLECT_DIR=$(mktemp -d "${TMPDIR_BASE}/nemoclaw-debug-XXXXXX") SANDBOX_SSH_CONFIG="" +SANDBOX_SSH_KNOWN="" cleanup() { rm -rf "$COLLECT_DIR" if [ -n "$SANDBOX_SSH_CONFIG" ]; then rm -f "$SANDBOX_SSH_CONFIG" fi + if [ -n "$SANDBOX_SSH_KNOWN" ]; then + rm -f "$SANDBOX_SSH_KNOWN" + fi } trap cleanup EXIT

Then at line 287-288, remove the second trap:

SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX") - trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/debug.sh` around lines 287 - 289, The new trap on EXIT overwrites the existing trap and prevents cleanup() from running; remove the separate trap command and instead add logic to the cleanup() function to remove the temporary SANDBOX_SSH_KNOWN file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap 'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when cleanup() runs.

cv · 2026-03-30T19:05:40Z

@dknos please ping me once you've resolved the CodeRabbit issues, I'd like to get this in soon!

wscurran · 2026-03-30T19:07:58Z

✨ Thanks for submitting this PR with a detailed summary, it addresses a bug with stale SSH host keys and proposes a fix to clear them after gateway restart, which could improve the stability and security of NemoClaw.

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

bin/lib/onboard.js Show resolved Hide resolved

Merge branch 'main' into fix/ssh-handshake-768

d30480b

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

fix: address CodeRabbit review — scope var, temp cleanup, host-field …

e28ce48

…filter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

WuKongAI-CMU mentioned this pull request Mar 30, 2026

feat(cli): add reconnect command #960

Open

Merge branch 'main' into fix/ssh-handshake-768

9a26302

wscurran added bug Something isn't working OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix labels Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clear stale SSH host keys after gateway restart#1087

fix: clear stale SSH host keys after gateway restart#1087
dknos wants to merge 4 commits intoNVIDIA:mainfrom
dknos:fix/ssh-handshake-768

dknos commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 30, 2026

Uh oh!

cv commented Mar 30, 2026

Uh oh!

wscurran commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dknos commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

cv commented Mar 30, 2026

Uh oh!

wscurran commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dknos commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading