fix: clear stale SSH host keys after gateway restart#1087
fix: clear stale SSH host keys after gateway restart#1087dknos wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
After a Docker container restart, the gateway regenerates its mTLS certificates but stale host key entries in ~/.ssh/known_hosts cause SSH handshake verification failures. This fix: - Purges old openshell host key entries from known_hosts when the gateway is destroyed during onboard - Uses ephemeral known_hosts files for sandbox SSH connections in debug.sh to avoid accumulating stale entries Fixes NVIDIA#768 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughRemoves stale SSH host key entries tied to gateway names before building gateway args, and updates the sandbox debug script to use a temporary known_hosts file with Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
1289-1291: Narrow known_hosts pruning to host-field matches only.Current filtering removes any line containing
openshell-, which can delete unrelated entries. Match only hostnames in the first known_hosts field.♻️ Safer filter
- const cleaned = kh.split("\n").filter(l => !l.includes("openshell-")).join("\n"); + const cleaned = kh + .split("\n") + .filter((line) => { + const trimmed = line.trim(); + if (!trimmed || trimmed.startsWith("#")) return true; + const hostField = trimmed.split(/\s+/)[0]; + return !hostField + .split(",") + .some((host) => host.startsWith("openshell-")); + }) + .join("\n");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 1289 - 1291, The current pruning removes any line containing "openshell-" anywhere; change the filter so it only inspects the first whitespace-separated field (the host field) of each known_hosts line: split each line on /\s+/ to get hostField = parts[0], split hostField by ',' into hostnames, and remove the line only if at least one hostname in that first field matches the openshell pattern (e.g., startsWith "openshell-" or matches /^openshell-/). Update the filter used to compute cleaned (currently l => !l.includes("openshell-")) to perform this host-field check and keep the existing write-back behavior for knownHostsPath when cleaned !== kh.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 1280-1284: The try/catch block calling execFileSync("ssh-keygen",
["-R", `openshell-${SANDBOX_NAME}`], ...) references an undefined identifier
SANDBOX_NAME so the removal never runs and the ReferenceError is swallowed;
replace SANDBOX_NAME with the module's actual sandbox name variable (e.g., the
constant or parameter used elsewhere such as sandboxName, SANDBOX, or similar)
in that block (locate the identifier used for the gateway/sandbox elsewhere in
this file) and update the catch to at least log the error via the existing
logger (or console) instead of an empty catch so failures are visible.
In `@scripts/debug.sh`:
- Around line 259-260: Temporary known_hosts file created via
SANDBOX_SSH_KNOWN=$(mktemp ...) is never removed; add a cleanup that removes
SANDBOX_SSH_KNOWN on exit/failure. Implement a cleanup function (e.g.,
cleanup_sandbox_known) that checks and rm -f "$SANDBOX_SSH_KNOWN" and register
it with trap 'cleanup_sandbox_known' EXIT (and optionally INT TERM) so the file
created for SANDBOX_SSH_OPTS is always removed when the script exits or is
interrupted.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 1289-1291: The current pruning removes any line containing
"openshell-" anywhere; change the filter so it only inspects the first
whitespace-separated field (the host field) of each known_hosts line: split each
line on /\s+/ to get hostField = parts[0], split hostField by ',' into
hostnames, and remove the line only if at least one hostname in that first field
matches the openshell pattern (e.g., startsWith "openshell-" or matches
/^openshell-/). Update the filter used to compute cleaned (currently l =>
!l.includes("openshell-")) to perform this host-field check and keep the
existing write-back behavior for knownHostsPath when cleaned !== kh.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9c5a0ac4-4fce-49ba-b643-cb13c5769a20
📒 Files selected for processing (2)
bin/lib/onboard.jsscripts/debug.sh
There was a problem hiding this comment.
♻️ Duplicate comments (1)
scripts/debug.sh (1)
287-288:⚠️ Potential issue | 🟡 MinorTemporary known_hosts file is still not cleaned up.
Line 287 creates
SANDBOX_SSH_KNOWN, butcleanup()never removes it, so temp files are left behind after each run.🧹 Proposed fix
SANDBOX_SSH_CONFIG="" +SANDBOX_SSH_KNOWN="" cleanup() { rm -rf "$COLLECT_DIR" if [ -n "$SANDBOX_SSH_CONFIG" ]; then rm -f "$SANDBOX_SSH_CONFIG" fi + if [ -n "$SANDBOX_SSH_KNOWN" ]; then + rm -f "$SANDBOX_SSH_KNOWN" + fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/debug.sh` around lines 287 - 288, The temp file created in SANDBOX_SSH_KNOWN is never removed; update the cleanup() function to unlink/remove the SANDBOX_SSH_KNOWN file if it exists (and unset the variable) and ensure the existing trap calls cleanup on EXIT so the temp known_hosts is deleted on script exit; reference SANDBOX_SSH_KNOWN and cleanup() when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@scripts/debug.sh`:
- Around line 287-288: The temp file created in SANDBOX_SSH_KNOWN is never
removed; update the cleanup() function to unlink/remove the SANDBOX_SSH_KNOWN
file if it exists (and unset the variable) and ensure the existing trap calls
cleanup on EXIT so the temp known_hosts is deleted on script exit; reference
SANDBOX_SSH_KNOWN and cleanup() when making this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f129851d-4293-48c4-a156-a17b2911c8dc
📒 Files selected for processing (2)
bin/lib/onboard.jsscripts/debug.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- bin/lib/onboard.js
…filter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/debug.sh`:
- Around line 287-289: The new trap on EXIT overwrites the existing trap and
prevents cleanup() from running; remove the separate trap command and instead
add logic to the cleanup() function to remove the temporary SANDBOX_SSH_KNOWN
file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any
SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap
'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for
and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when
cleanup() runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 668e4297-e87d-4a38-84c5-bd3b8867b207
📒 Files selected for processing (2)
bin/lib/onboard.jsscripts/debug.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- bin/lib/onboard.js
| SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX") | ||
| trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT | ||
| SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '\n== trap declarations in scripts/debug.sh ==\n'
nl -ba scripts/debug.sh | sed -n '84,95p;284,290p'
printf '\n== bash trap behavior repro ==\n'
bash -lc '
cleanup(){ echo cleanup; }
trap cleanup EXIT
trap '\''echo sandbox_known'\'' EXIT
'Repository: NVIDIA/NemoClaw
Length of output: 163
🏁 Script executed:
#!/bin/bash
# First, check file size to ensure we can read it safely
wc -l scripts/debug.sh
# Read the cleanup function and first trap declaration
echo "=== Lines 84-95 (cleanup function area) ==="
sed -n '84,95p' scripts/debug.sh
# Read the SANDBOX_SSH_KNOWN and second trap declaration
echo ""
echo "=== Lines 284-290 (SANDBOX_SSH_KNOWN area) ==="
sed -n '284,290p' scripts/debug.sh
# Test bash trap behavior with available tools
echo ""
echo "=== Bash trap behavior test ==="
bash << 'BASHEOF'
cleanup(){ echo "cleanup called"; }
trap cleanup EXIT
trap 'echo sandbox_known' EXIT
echo "Exiting..."
BASHEOFRepository: NVIDIA/NemoClaw
Length of output: 955
Don't replace the existing EXIT trap here.
Line 288 overwrites trap cleanup EXIT (line 94), preventing the cleanup function from running on exit. This leaves COLLECT_DIR and SANDBOX_SSH_CONFIG behind in ${TMPDIR_BASE} on every successful run.
Fold the cleanup of SANDBOX_SSH_KNOWN into the cleanup() function instead of creating a separate trap:
Fix
TMPDIR_BASE="${TMPDIR:-/tmp}"
COLLECT_DIR=$(mktemp -d "${TMPDIR_BASE}/nemoclaw-debug-XXXXXX")
SANDBOX_SSH_CONFIG=""
+SANDBOX_SSH_KNOWN=""
cleanup() {
rm -rf "$COLLECT_DIR"
if [ -n "$SANDBOX_SSH_CONFIG" ]; then
rm -f "$SANDBOX_SSH_CONFIG"
fi
+ if [ -n "$SANDBOX_SSH_KNOWN" ]; then
+ rm -f "$SANDBOX_SSH_KNOWN"
+ fi
}
trap cleanup EXITThen at line 287-288, remove the second trap:
SANDBOX_SSH_KNOWN=$(mktemp "${TMPDIR_BASE}/nemoclaw-ssh-known-XXXXXX")
- trap 'rm -f "$SANDBOX_SSH_KNOWN"' EXIT
SANDBOX_SSH_OPTS=(-F "$SANDBOX_SSH_CONFIG" -o StrictHostKeyChecking=accept-new -o "UserKnownHostsFile=$SANDBOX_SSH_KNOWN" -o ConnectTimeout=10)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/debug.sh` around lines 287 - 289, The new trap on EXIT overwrites the
existing trap and prevents cleanup() from running; remove the separate trap
command and instead add logic to the cleanup() function to remove the temporary
SANDBOX_SSH_KNOWN file (created as SANDBOX_SSH_KNOWN=$(mktemp ...)) and any
SANDBOX_SSH_CONFIG/COLLECT_DIR artifacts under TMPDIR_BASE so the single trap
'cleanup EXIT' (from earlier) handles all teardown; update cleanup() to test for
and rm -f "$SANDBOX_SSH_KNOWN" and ensure SANDBOX_SSH_KNOWN is set in scope when
cleanup() runs.
|
@dknos please ping me once you've resolved the CodeRabbit issues, I'd like to get this in soon! |
|
✨ Thanks for submitting this PR with a detailed summary, it addresses a bug with stale SSH host keys and proposes a fix to clear them after gateway restart, which could improve the stability and security of NemoClaw. |
Summary
openshell-*entries from~/.ssh/known_hostswhen the gateway is destroyed duringnemoclaw onboard, so freshly regenerated mTLS certificates aren't rejected by leftover host key fingerprintsdebug.shsandbox SSH connections to use ephemeralknown_hostsfiles withStrictHostKeyChecking=accept-newinstead of the persistent system fileRoot cause
After a Docker container restart (or Mac reboot), the gateway regenerates its mTLS certificates. However,
~/.ssh/known_hostsstill contains the old host key fingerprint foropenshell-<sandbox>. SSH rejects the new certificate with "handshake verification failed" — the only workaround was destroying and recreating the entire sandbox.Test plan
nemoclaw <name> status— should reconnect without errorsnemoclaw onboardon a fresh setup — verify SSH connections succeednemoclaw debug— verify sandbox introspection worksFixes #768
🤖 Generated with Claude Code
Summary by CodeRabbit