Skip to content

fix: clear stale SSH host keys after gateway restart#1087

Open
dknos wants to merge 4 commits intoNVIDIA:mainfrom
dknos:fix/ssh-handshake-768
Open

fix: clear stale SSH host keys after gateway restart#1087
dknos wants to merge 4 commits intoNVIDIA:mainfrom
dknos:fix/ssh-handshake-768

Conversation

@dknos
Copy link
Copy Markdown

@dknos dknos commented Mar 30, 2026

Summary

  • Purges stale openshell-* entries from ~/.ssh/known_hosts when the gateway is destroyed during nemoclaw onboard, so freshly regenerated mTLS certificates aren't rejected by leftover host key fingerprints
  • Switches debug.sh sandbox SSH connections to use ephemeral known_hosts files with StrictHostKeyChecking=accept-new instead of the persistent system file

Root cause

After a Docker container restart (or Mac reboot), the gateway regenerates its mTLS certificates. However, ~/.ssh/known_hosts still contains the old host key fingerprint for openshell-<sandbox>. SSH rejects the new certificate with "handshake verification failed" — the only workaround was destroying and recreating the entire sandbox.

Test plan

  • Reboot Docker / restart the container
  • Run nemoclaw <name> status — should reconnect without errors
  • Run nemoclaw onboard on a fresh setup — verify SSH connections succeed
  • Run nemoclaw debug — verify sandbox introspection works

Fixes #768

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Automatically remove stale SSH host keys before launching a new sandbox gateway to reduce connection failures from leftover entries.
    • Improve sandbox diagnostics SSH behavior by using a temporary known-hosts file, cleaning it up on exit, and switching to stricter host key verification for safer connections.

After a Docker container restart, the gateway regenerates its mTLS
certificates but stale host key entries in ~/.ssh/known_hosts cause
SSH handshake verification failures. This fix:

- Purges old openshell host key entries from known_hosts when the
  gateway is destroyed during onboard
- Uses ephemeral known_hosts files for sandbox SSH connections in
  debug.sh to avoid accumulating stale entries

Fixes NVIDIA#768

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eab4c422-7f9c-4082-9628-cc487aba1420

📥 Commits

Reviewing files that changed from the base of the PR and between e28ce48 and 9a26302.

📒 Files selected for processing (1)
  • bin/lib/onboard.js
✅ Files skipped from review due to trivial changes (1)
  • bin/lib/onboard.js

📝 Walkthrough

Walkthrough

Removes stale SSH host key entries tied to gateway names before building gateway args, and updates the sandbox debug script to use a temporary known_hosts file with UserKnownHostsFile and StrictHostKeyChecking=accept-new, plus cleanup on exit.

Changes

Cohort / File(s) Summary
Gateway SSH host key cleanup
bin/lib/onboard.js
Before constructing gateway args, runs ssh-keygen -R openshell-${GATEWAY_NAME} (output/errors suppressed) and, if ~/.ssh/known_hosts exists, filters out lines whose hostname field contains openshell-, writing cleaned contents back (read/write errors suppressed).
Sandbox debug SSH behavior
scripts/debug.sh
Creates a temp known-hosts file passed via UserKnownHostsFile=$SANDBOX_SSH_KNOWN, switches StrictHostKeyChecking=noStrictHostKeyChecking=accept-new, updates SSH options arrays to include the known-hosts file, and adds a trap to remove the temp file on exit.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through ~/.ssh at night,
I scratched out stale keys by moonlight,
A temp-file crinkled, tidy and new,
"Accept-new" I whispered — the gateway flew,
Now tunnels hum and I nibble a carrot. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: clearing stale SSH host keys after gateway restart, which aligns with the primary fix in the pull request.
Linked Issues check ✅ Passed The PR implements the core requirements from issue #768: removes stale SSH host key entries, uses ephemeral known_hosts in debug.sh, and eliminates the need for sandbox destroy/recreate after restarts.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the SSH host key synchronization issue: gateway restart SSH key cleanup and debug.sh ephemeral known_hosts implementation address the root cause.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

1289-1291: Narrow known_hosts pruning to host-field matches only.

Current filtering removes any line containing openshell-, which can delete unrelated entries. Match only hostnames in the first known_hosts field.

♻️ Safer filter
-      const cleaned = kh.split("\n").filter(l => !l.includes("openshell-")).join("\n");
+      const cleaned = kh
+        .split("\n")
+        .filter((line) => {
+          const trimmed = line.trim();
+          if (!trimmed || trimmed.startsWith("#")) return true;
+          const hostField = trimmed.split(/\s+/)[0];
+          return !hostField
+            .split(",")
+            .some((host) => host.startsWith("openshell-"));
+        })
+        .join("\n");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 1289 - 1291, The current pruning removes any
line containing "openshell-" anywhere; change the filter so it only inspects the
first whitespace-separated field (the host field) of each known_hosts line:
split each line on /\s+/ to get hostField = parts[0], split hostField by ','
into hostnames, and remove the line only if at least one hostname in that first
field matches the openshell pattern (e.g., startsWith "openshell-" or matches
/^openshell-/). Update the filter used to compute cleaned (currently l =>
!l.includes("openshell-")) to perform this host-field check and keep the
existing write-back behavior for knownHostsPath when cleaned !== kh.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1280-1284: The try/catch block calling execFileSync("ssh-keygen",
["-R", `openshell-${SANDBOX_NAME}`], ...) references an undefined identifier
SANDBOX_NAME so the removal never runs and the ReferenceError is swallowed;
replace SANDBOX_NAME with the module's actual sandbox name variable (e.g., the
constant or parameter used elsewhere such as sandboxName, SANDBOX, or similar)
in that block (locate the identifier used for the gateway/sandbox elsewhere in
this file) and update the catch to at least log the error via the existing
logger (or console) instead of an empty catch so failures are visible.

In `@scripts/debug.sh`:
- Around line 259-260: Temporary known_hosts file created via
SANDBOX_SSH_KNOWN=$(mktemp ...) is never removed; add a cleanup that removes
SANDBOX_SSH_KNOWN on exit/failure. Implement a cleanup function (e.g.,
cleanup_sandbox_known) that checks and rm -f "$SANDBOX_SSH_KNOWN" and register
it with trap 'cleanup_sandbox_known' EXIT (and optionally INT TERM) so the file
created for SANDBOX_SSH_OPTS is always removed when the script exits or is
interrupted.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1289-1291: The current pruning removes any line containing
"openshell-" anywhere; change the filter so it only inspects the first
whitespace-separated field (the host field) of each known_hosts line: split each
line on /\s+/ to get hostField = parts[0], split hostField by ',' into
hostnames, and remove the line only if at least one hostname in that first field
matches the openshell pattern (e.g., startsWith "openshell-" or matches
/^openshell-/). Update the filter used to compute cleaned (currently l =>
!l.includes("openshell-")) to perform this host-field check and keep the
existing write-back behavior for knownHostsPath when cleaned !== kh.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9c5a0ac4-4fce-49ba-b643-cb13c5769a20

📥 Commits

Reviewing files that changed from the base of the PR and between f59f58e and 51a1564.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • scripts/debug.sh

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
scripts/debug.sh (1)

287-288: ⚠️ Potential issue | 🟡 Minor

Temporary known_hosts file is still not cleaned up.

Line 287 creates SANDBOX_SSH_KNOWN, but cleanup() never removes it, so temp files are left behind after each run.

🧹 Proposed fix
 SANDBOX_SSH_CONFIG=""
+SANDBOX_SSH_KNOWN=""
 cleanup() {
   rm -rf "$COLLECT_DIR"
   if [ -n "$SANDBOX_SSH_CONFIG" ]; then
     rm -f "$SANDBOX_SSH_CONFIG"
   fi
+  if [ -n "$SANDBOX_SSH_KNOWN" ]; then
+    rm -f "$SANDBOX_SSH_KNOWN"
+  fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/debug.sh` around lines 287 - 288, The temp file created in
SANDBOX_SSH_KNOWN is never removed; update the cleanup() function to
unlink/remove the SANDBOX_SSH_KNOWN file if it exists (and unset the variable)
and ensure the existing trap calls cleanup on EXIT so the temp known_hosts is
deleted on script exit; reference SANDBOX_SSH_KNOWN and cleanup() when making
this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@scripts/debug.sh`:
- Around line 287-288: The temp file created in SANDBOX_SSH_KNOWN is never
removed; update the cleanup() function to unlink/remove the SANDBOX_SSH_KNOWN
file if it exists (and unset the variable) and ensure the existing trap calls
cleanup on EXIT so the temp known_hosts is deleted on script exit; reference
SANDBOX_SSH_KNOWN and cleanup() when making this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f129851d-4293-48c4-a156-a17b2911c8dc

📥 Commits

Reviewing files that changed from the base of the PR and between 51a1564 and d30480b.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • scripts/debug.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/onboard.js

…filter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/debug.sh`:
- Around line 287-289: The new trap on EXIT overwrites the existing trap and
prevents cleanup() from running; remove the separate trap command and instead
add logic to the cleanup() function to remove the temporary SANDBOX_SSH_KNOWN
file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any
SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap
'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for
and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when
cleanup() runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 668e4297-e87d-4a38-84c5-bd3b8867b207

📥 Commits

Reviewing files that changed from the base of the PR and between d30480b and e28ce48.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • scripts/debug.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/onboard.js

Comment on lines +287 to +289
SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX")
trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT
SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n== trap declarations in scripts/debug.sh ==\n'
nl -ba scripts/debug.sh | sed -n '84,95p;284,290p'

printf '\n== bash trap behavior repro ==\n'
bash -lc '
cleanup(){ echo cleanup; }
trap cleanup EXIT
trap '\''echo sandbox_known'\'' EXIT
'

Repository: NVIDIA/NemoClaw

Length of output: 163


🏁 Script executed:

#!/bin/bash

# First, check file size to ensure we can read it safely
wc -l scripts/debug.sh

# Read the cleanup function and first trap declaration
echo "=== Lines 84-95 (cleanup function area) ==="
sed -n '84,95p' scripts/debug.sh

# Read the SANDBOX_SSH_KNOWN and second trap declaration
echo ""
echo "=== Lines 284-290 (SANDBOX_SSH_KNOWN area) ==="
sed -n '284,290p' scripts/debug.sh

# Test bash trap behavior with available tools
echo ""
echo "=== Bash trap behavior test ==="
bash << 'BASHEOF'
cleanup(){ echo "cleanup called"; }
trap cleanup EXIT
trap 'echo sandbox_known' EXIT
echo "Exiting..."
BASHEOF

Repository: NVIDIA/NemoClaw

Length of output: 955


Don't replace the existing EXIT trap here.

Line 288 overwrites trap cleanup EXIT (line 94), preventing the cleanup function from running on exit. This leaves COLLECT_DIR and SANDBOX_SSH_CONFIG behind in ${TMPDIR_BASE} on every successful run.

Fold the cleanup of SANDBOX_SSH_KNOWN into the cleanup() function instead of creating a separate trap:

Fix
 TMPDIR_BASE="${TMPDIR:-/tmp}"
 COLLECT_DIR=$(mktemp -d "${TMPDIR_BASE}/nemoclaw-debug-XXXXXX")
 SANDBOX_SSH_CONFIG=""
+SANDBOX_SSH_KNOWN=""
 cleanup() {
   rm -rf "$COLLECT_DIR"
   if [ -n "$SANDBOX_SSH_CONFIG" ]; then
     rm -f "$SANDBOX_SSH_CONFIG"
   fi
+  if [ -n "$SANDBOX_SSH_KNOWN" ]; then
+    rm -f "$SANDBOX_SSH_KNOWN"
+  fi
 }
 trap cleanup EXIT

Then at line 287-288, remove the second trap:

     SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX")
-    trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT
     SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/debug.sh` around lines 287 - 289, The new trap on EXIT overwrites the
existing trap and prevents cleanup() from running; remove the separate trap
command and instead add logic to the cleanup() function to remove the temporary
SANDBOX_SSH_KNOWN file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any
SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap
'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for
and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when
cleanup() runs.

@cv
Copy link
Copy Markdown
Contributor

cv commented Mar 30, 2026

@dknos please ping me once you've resolved the CodeRabbit issues, I'd like to get this in soon!

@wscurran wscurran added bug Something isn't working OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix labels Mar 30, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR with a detailed summary, it addresses a bug with stale SSH host keys and proposes a fix to clear them after gateway restart, which could improve the stability and security of NemoClaw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS] SSH handshake verification fails after Docker container restart — sandbox requires destroy/recreate

4 participants