feat: add Kubernetes testing infrastructure#227
Conversation
Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes with Dynamo vLLM inference. Includes: - test-installer.sh: Public installer test (requires unattended install support) - setup.sh: Manual setup from source for development - Pod manifests for Docker-in-Docker execution Architecture: OpenShell runs k3s inside Docker, so we use DinD pods to provide the Docker daemon on Kubernetes. Signed-off-by: rwipfelnv
OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names, so inference requests fail with 502 Bad Gateway. This adds: - socat TCP proxy setup in setup.sh to forward localhost:8000 to the K8s vLLM service endpoint - Provider configuration using host.openshell.internal:8000 which resolves to the workspace container from inside k3s - Documentation explaining the network architecture and workaround - Updated env var names to match PR NVIDIA#318 (NEMOCLAW_NON_INTERACTIVE) - cgroup v2 compatibility fix for Docker daemon - Removed memory limits that caused OOM Tested: Inference requests from sandboxes now route correctly through the socat proxy to the Dynamo vLLM endpoint. Depends on: NVIDIA#318 (non-interactive mode), NVIDIA#365 (Dynamo provider) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete K8s deployment solution for NemoClaw: - nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage - install.sh: Interactive installer with preflight checks - Rename k8s-testing -> k8s, move old files to dev/ Key learnings: - hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction - Init containers for docker config, openshell CLI, NemoClaw build - Workspace container installs apt packages at runtime (can't share via volumes) - socat proxy bridges K8s DNS to nested k3s (host.openshell.internal) Tested successfully with Dynamo vLLM backend on EKS. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
NemoKlaw Testing Results ✅Successfully tested NemoKlaw (NemoClaw on Kubernetes) with Dynamo vLLM backend on EKS. What's Working
Key Fix: hostPath StorageThe main blocker was ephemeral storage eviction during sandbox image builds. Fixed by using hostPath storage on the NVMe RAID array ( Files Added/Updated
Sample OutputNext Steps
|
jacobtomlinson
left a comment
There was a problem hiding this comment.
This is nice to see thanks. I left some comments.
We should also work with @jayavenkatesh19 to get this running in GitHub Actions. What is the hardware requirement for this? I expect we will need a GPU CI runner.
Address PR feedback: - Rename NemoKlaw -> NemoClaw (avoid confusing naming) - Rename nemoklaw.yaml -> nemoclaw-k8s.yaml - Fix hardcoded endpoint to use generic example - Remove log file from repo - Document known limitations (HTTPS proxy issue) - Update README with accurate status of what works/doesn't work Signed-off-by: rwipfelnv Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The aggregated frontend service is the correct endpoint for Dynamo vLLM inference. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
K8s Testing Results - NemoClaw on KubernetesTested NemoClaw deployment on Kubernetes with Dynamo vLLM inference endpoint using the manifests in this PR. Onboard LogsTesting Dynamo InferenceFrom inside the sandbox, the Dynamo endpoint is accessible via # List available models
wget -qO- http://host.openshell.internal:8000/v1/models
# {"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B-Instruct","object":"model","created":1774352462,"owned_by":"nvidia"}]}
# Test chat completions
wget -qO- --post-data='{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Say hello"}],"max_tokens":20}' \
--header='Content-Type: application/json' \
http://host.openshell.internal:8000/v1/chat/completions
# {"choices":[{"message":{"content":"Hello. How can I assist you today?","role":"assistant"}}],"model":"meta-llama/Llama-3.1-8B-Instruct"}How the K8s Deployment Works
Deployment Steps# Deploy Dynamo (if not already running)
kubectl apply -f dynamo-manifests/
# Deploy NemoClaw
kubectl apply -f k8s/nemoclaw-k8s.yaml
# Check logs
kubectl logs nemoclaw -n nemoclaw -c workspace -f✅ K8s deployment with Dynamo inference working correctly. Note: Requires PR #365 for Dynamo provider support in |
- Add workspace shell access command - Add sandbox status/logs/list commands - Add chat completion test example - Rename section from "What Can You Do?" to "Using NemoClaw" Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
📝 WalkthroughWalkthroughAdds an experimental Kubernetes deployment and documentation for running NemoClaw with a privileged Docker-in-Docker pod, socat proxying to an external inference endpoint, and an automated installer workflow for GPU inference within a k3s sandboxed environment. Changes
Sequence Diagram(s)sequenceDiagram
participant K8s as Kubernetes API
participant Init as init container
participant DinD as docker:dind sidecar
participant Workspace as workspace container
participant Socat as socat proxy
participant Inference as Dynamo/vLLM endpoint
K8s->>Init: create pod & run init
Init->>DinD: write /etc/docker/daemon.json (host cgroupns)
K8s->>DinD: start dockerd (DinD)
DinD-->>Workspace: expose docker socket (/var/run/docker.sock)
Workspace->>DinD: poll `docker info` until ready
Workspace->>Socat: start socat proxy listening :8000
Socat->>Inference: forward traffic to $DYNAMO_HOST
Workspace->>Inference: run NemoClaw installer (connect to NEMOCLAW_ENDPOINT_URL)
Workspace->>Workspace: keep running (sleep infinity) after onboarding
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (1)
k8s/dev/setup.sh (1)
14-16:⚠️ Potential issue | 🟡 MinorDefault endpoint uses personal namespace.
The default
VLLM_ENDPOINTreferencesrobert.svc.cluster.local, which appears to be a user-specific namespace. This should use a more generic default or placeholder.Proposed fix
# vLLM endpoint (Dynamo) -VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://vllm-agg-frontend.robert.svc.cluster.local:8000/v1}" +VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://vllm-agg-frontend.dynamo.svc.cluster.local:8000/v1}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/dev/setup.sh` around lines 14 - 16, The default VLLM_ENDPOINT in setup.sh uses a personal namespace ("robert.svc.cluster.local"); update the VLLM_ENDPOINT default to a generic or placeholder host (e.g., localhost, vllm.svc.cluster.local, or a clearly documented placeholder) so it doesn't reference a user-specific namespace; modify the VLLM_ENDPOINT variable definition (VLLM_ENDPOINT="${VLLM_ENDPOINT:-...}") to the chosen generic default and add a brief inline comment if needed to indicate it should be overridden in deployment.
🧹 Nitpick comments (4)
k8s/dev/setup.sh (1)
7-8: Unused variableOPENSHELL_DIR.
OPENSHELL_DIRis defined but never used in the script. Consider removing it or documenting its intended purpose.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/dev/setup.sh` around lines 7 - 8, The script defines OPENSHELL_DIR (OPENSHELL_DIR="${OPENSHELL_DIR:-$NEMOCLAW_DIR/../OpenShell}") but never uses it; either remove this unused variable declaration or implement its intended usage where needed. Locate the declaration near the existing NEMOCLAW_DIR and SCRIPT_DIR logic, then either delete the OPENSHELL_DIR line to avoid dead variables or add the appropriate references to OPENSHELL_DIR where the OpenShell path is required (e.g., export it, use it for path resolution, or document its purpose).k8s/install.sh (1)
84-85: Deprecatedkubectl version --shortflag.The
--shortflag forkubectl versionis deprecated. The fallback uses JSON parsing which is the correct approach, but the primary command will emit a deprecation warning.Proposed fix
-info "kubectl found: $(kubectl version --client --short 2>/dev/null || kubectl version --client -o json | grep gitVersion | head -1)" +info "kubectl found: $(kubectl version --client -o json 2>/dev/null | jq -r '.clientVersion.gitVersion' 2>/dev/null || kubectl version --client 2>&1 | head -1)"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/install.sh` around lines 84 - 85, Replace the deprecated `kubectl version --client --short` usage in the info call with the JSON-based client version parsing: run `kubectl version --client -o json`, extract the client gitVersion (e.g., parsing `.clientVersion.gitVersion`), and then pass that extracted version into the info message; update the invocation around the info function so it no longer uses `--short` and always falls back to the JSON parsing logic currently used.k8s/dev/openshell-gateway.yaml (1)
6-7: Consider using a dedicated namespace instead ofdefault.Deploying to the
defaultnamespace may conflict with other workloads and makes cleanup harder. The other manifests use thenemoclawnamespace. For consistency and isolation, consider using the same namespace or making it configurable.Proposed change
metadata: name: openshell-gateway - namespace: default + namespace: nemoclaw🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/dev/openshell-gateway.yaml` around lines 6 - 7, Update the Deployment/Service manifest for openshell-gateway to avoid using the default namespace: replace the hard-coded namespace value "default" with the dedicated namespace used by other manifests (e.g., "nemoclaw") or switch it to a configurable placeholder so deployments can target the proper namespace; locate the resource named "openshell-gateway" and change its namespace field accordingly and ensure any RBAC/ServiceAccount references (if present) align with the new namespace.k8s/dev/nemoclaw-installer-test.yaml (1)
89-95: UsingemptyDirfor Docker storage may cause ephemeral storage exhaustion.The main manifest (
k8s/nemoclaw-k8s.yaml) explicitly useshostPathwith a note thatemptyDircauses ephemeral storage eviction during sandbox image builds. This test manifest usesemptyDirwhich could make tests flaky on clusters with limited ephemeral storage.Consider whether this is intentional for simpler test cleanup, or if it should align with the main manifest's approach.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/dev/nemoclaw-installer-test.yaml` around lines 89 - 95, The test manifest uses emptyDir for the volumes "docker-storage", "docker-socket", and "docker-config", which can exhaust ephemeral node storage during sandbox image builds; update these volume definitions to match the main manifest by switching from emptyDir to hostPath (or another persistent volume type) for "docker-storage" (and consider hostPath for the others) so builds won't be evicted, or if ephemeral storage is intentional add a comment and make the choice configurable for CI; locate and modify the volume entries named docker-storage, docker-socket, and docker-config in the test manifest to implement this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/dev/nemoclaw-installer-test.yaml`:
- Around line 60-63: Replace the mismatched env var names
NEMOCLAW_DYNAMO_ENDPOINT and NEMOCLAW_DYNAMO_MODEL with the expected names
NEMOCLAW_ENDPOINT_URL and NEMOCLAW_MODEL so the onboard consumer reads them;
update the values only (keep the same endpoint and model strings) and ensure any
references in the deployment spec or container env section that currently use
NEMOCLAW_DYNAMO_* now point to NEMOCLAW_ENDPOINT_URL and NEMOCLAW_MODEL to match
the loader that reads those exact keys.
In `@k8s/dev/setup.sh`:
- Around line 181-185: Update the example curl in setup.sh to use HTTP rather
than HTTPS to match k8s/README.md: replace the request URL in the curl example
that currently targets "https://inference.local/v1/chat/completions" with
"http://inference.local/v1/chat/completions" (the rest of the invocation using
$VLLM_MODEL and the JSON payload stays the same) so the inline sandbox test
reflects the documented working protocol.
- Around line 119-131: The echo and variable expansion inside the bash -c string
are broken because $VLLM_HOST_PORT is inside single quotes; change the inner
echo to use double quotes (or otherwise ensure $VLLM_HOST_PORT is not
single-quoted) so the variable expands, and verify the nohup socat
TCP:$VLLM_HOST_PORT invocation is also evaluated in the same context; update the
echo line ("socat proxy started: localhost:8000 -> $VLLM_HOST_PORT") and any
other single-quoted occurrences that should expand the POD_NAME, NAMESPACE or
VLLM_HOST_PORT variables to use double quotes or proper escaping.
In `@k8s/dev/test-installer.sh`:
- Around line 41-42: DYNAMO_MODEL is being defined but never used; either pass
it into the generated YAML or remove it. If you want to keep it, update the code
that prepares the manifest (the sed/substitution logic that currently looks at
NEMOCLAW_DYNAMO_ENDPOINT) to also substitute DYNAMO_MODEL into the template
(reference DYNAMO_MODEL and the sed/replace step), or if the model value should
come from the YAML default, remove the DYNAMO_MODEL variable entirely. Also
change the default DYNAMO_ENDPOINT from the user-specific "robert" namespace to
a neutral default or document that it must be overridden (reference
NEMOCLAW_DYNAMO_ENDPOINT).
In `@k8s/install.sh`:
- Around line 319-323: The script is cloning a personal-fork branch (git clone
... --branch rwipfelnv/dynamo-support https://github.com/rwipfelnv/NemoClaw.git)
— replace the fork/branch reference with the official upstream repository and
appropriate branch (e.g., change the URL to
https://github.com/NemoClaw/NemoClaw.git and set --branch to the official branch
name such as main or the upstream dynamo-support branch) so that the git clone
in the block that uses "git clone --depth 1 --branch rwipfelnv/dynamo-support
https://github.com/rwipfelnv/NemoClaw.git nemoclaw-src" pulls from the official
repo before merging.
- Around line 229-236: The env var names in this manifest are inconsistent with
the onboarding/verification script: change the incorrectly named
NEMOCLAW_DYNAMO_ENDPOINT and NEMOCLAW_DYNAMO_MODEL to the canonical env var
names used by the onboarding code (match the expected names in
nemoclaw-k8s.yaml/verification script), keeping NEMOCLAW_PROVIDER="dynamo" and
NEMOCLAW_SANDBOX_NAME unchanged; update any references to these symbols so the
onboarding function reads the correct variables (verify by comparing the exact
env var identifiers in the verification manifest).
- Around line 298-306: The HOST_PORT extraction fails for endpoints starting
with https://; update the HOST_PORT computation (the code that parses ENDPOINT
into HOST_PORT) to strip both http:// and https:// prefixes and then remove the
trailing /v1, e.g. use a single sed that handles optional "s" (s|^https\?://||)
or use shell parameter expansion (strip protocol then remove /v1) so HOST_PORT
is correct for both HTTP and HTTPS endpoints; modify the lines that set ENDPOINT
and HOST_PORT accordingly.
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 148-153: The init container currently clones the personal
fork/branch "rwipfelnv/dynamo-support" from "rwipfelnv/NemoClaw.git"; update the
git clone command to point to the official NVIDIA repository and a stable branch
(e.g., change "git clone --depth 1 --branch rwipfelnv/dynamo-support
https://github.com/rwipfelnv/NemoClaw.git" to use
"https://github.com/NVIDIA/NemoClaw.git" and a production branch like "main" or
a specific release tag) so the commands in the block (git clone, npm install,
npm run build) use the upstream source instead of a personal fork.
In `@k8s/README.md`:
- Around line 98-100: The fenced code blocks in README.md for the shell prompt
and the architecture diagram are missing language identifiers; update the
opening backtick fences for the examples around the shell prompt (the block
containing "sandbox@my-assistant:~$") and the ASCII architecture diagram (the
block starting with the box top "┌────────────────…") to include a language tag
such as text or plaintext (e.g., change ``` to ```text) so static analysis
recognizes them; ensure you change both the block around the prompt and the
larger diagram block (lines indicated in the review) to use the same tag.
- Around line 239-253: Replace the emoji status markers in the markdown tables
with text alternatives to match the "No emoji in technical prose" guideline:
change ✅ to "Working" (or "Yes") and ❌ to "Blocked" (or "No") in the status
column entries for rows such as "Unattended onboard", "Sandbox creation",
"Connect to sandbox", "HTTP inference", "Socat proxy bridge", and the "What
Doesn't Work Yet" rows "openclaw tui", "openclaw agent", "HTTPS inference";
ensure the table alignment and pipe separators remain valid and update any
inline notes that reference emojis to use the same textual statuses.
---
Duplicate comments:
In `@k8s/dev/setup.sh`:
- Around line 14-16: The default VLLM_ENDPOINT in setup.sh uses a personal
namespace ("robert.svc.cluster.local"); update the VLLM_ENDPOINT default to a
generic or placeholder host (e.g., localhost, vllm.svc.cluster.local, or a
clearly documented placeholder) so it doesn't reference a user-specific
namespace; modify the VLLM_ENDPOINT variable definition
(VLLM_ENDPOINT="${VLLM_ENDPOINT:-...}") to the chosen generic default and add a
brief inline comment if needed to indicate it should be overridden in
deployment.
---
Nitpick comments:
In `@k8s/dev/nemoclaw-installer-test.yaml`:
- Around line 89-95: The test manifest uses emptyDir for the volumes
"docker-storage", "docker-socket", and "docker-config", which can exhaust
ephemeral node storage during sandbox image builds; update these volume
definitions to match the main manifest by switching from emptyDir to hostPath
(or another persistent volume type) for "docker-storage" (and consider hostPath
for the others) so builds won't be evicted, or if ephemeral storage is
intentional add a comment and make the choice configurable for CI; locate and
modify the volume entries named docker-storage, docker-socket, and docker-config
in the test manifest to implement this change.
In `@k8s/dev/openshell-gateway.yaml`:
- Around line 6-7: Update the Deployment/Service manifest for openshell-gateway
to avoid using the default namespace: replace the hard-coded namespace value
"default" with the dedicated namespace used by other manifests (e.g.,
"nemoclaw") or switch it to a configurable placeholder so deployments can target
the proper namespace; locate the resource named "openshell-gateway" and change
its namespace field accordingly and ensure any RBAC/ServiceAccount references
(if present) align with the new namespace.
In `@k8s/dev/setup.sh`:
- Around line 7-8: The script defines OPENSHELL_DIR
(OPENSHELL_DIR="${OPENSHELL_DIR:-$NEMOCLAW_DIR/../OpenShell}") but never uses
it; either remove this unused variable declaration or implement its intended
usage where needed. Locate the declaration near the existing NEMOCLAW_DIR and
SCRIPT_DIR logic, then either delete the OPENSHELL_DIR line to avoid dead
variables or add the appropriate references to OPENSHELL_DIR where the OpenShell
path is required (e.g., export it, use it for path resolution, or document its
purpose).
In `@k8s/install.sh`:
- Around line 84-85: Replace the deprecated `kubectl version --client --short`
usage in the info call with the JSON-based client version parsing: run `kubectl
version --client -o json`, extract the client gitVersion (e.g., parsing
`.clientVersion.gitVersion`), and then pass that extracted version into the info
message; update the invocation around the info function so it no longer uses
`--short` and always falls back to the JSON parsing logic currently used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 495b5934-f061-4f76-b4fa-0e5ac8381a18
📒 Files selected for processing (7)
k8s/README.mdk8s/dev/nemoclaw-installer-test.yamlk8s/dev/openshell-gateway.yamlk8s/dev/setup.shk8s/dev/test-installer.shk8s/install.shk8s/nemoclaw-k8s.yaml
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
k8s/README.md (1)
127-148:⚠️ Potential issue | 🟡 MinorAdd a language identifier to the architecture fenced block.
The code fence starting at Line 127 is missing a language tag (
text/plaintext), which trips markdown linting.Proposed fix
-``` +```text ┌─────────────────────────────────────────────────────────────────┐ ... -``` +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/README.md` around lines 127 - 148, The fenced ASCII-art block beginning with the triple backticks (``` ) that contains the "Kubernetes Cluster" diagram is missing a language tag; update its opening fence to include a language identifier (e.g., change ``` to ```text) so the block is treated as plain text by Markdown linters and renderers—look for the block that starts with the box drawing "┌─────────────────────────────────────────────────────────────────┐" and modify its opening fence accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 69-71: Replace the inline pipe-exec of the remote installer ("curl
-fsSL https://nvidia.com/nemoclaw.sh | bash") with a download-then-verify flow:
curl or wget the installer into a local file (e.g., nemoclaw.sh), obtain and
verify a pinned checksum or GPG signature for that exact release (or fetch a
versioned artifact URL instead of the root URL), only after successful
verification set secure permissions and execute the local file (bash
./nemoclaw.sh); ensure failure paths abort the job and log the verification
error.
In `@k8s/README.md`:
- Around line 92-97: The README's curl examples use https://inference.local but
the deployment in this PR routes through socat over plain HTTP; update the two
examples (the model list curl and the chat completions curl showing model
"meta-llama/Llama-3.1-8B-Instruct") and the later example at line 107 to use the
configured HTTP endpoint (http://host.openshell.internal:8000/v1) instead of
https://inference.local so the documented commands match the running proxy and
will succeed.
---
Duplicate comments:
In `@k8s/README.md`:
- Around line 127-148: The fenced ASCII-art block beginning with the triple
backticks (``` ) that contains the "Kubernetes Cluster" diagram is missing a
language tag; update its opening fence to include a language identifier (e.g.,
change ``` to ```text) so the block is treated as plain text by Markdown linters
and renderers—look for the block that starts with the box drawing
"┌─────────────────────────────────────────────────────────────────┐" and modify
its opening fence accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4c5bce22-7b14-4615-8e74-e9df88aa1560
📒 Files selected for processing (2)
k8s/README.mdk8s/nemoclaw-k8s.yaml
- Use official NemoClaw installer (`curl | bash`) instead of git clone/build - Switch to `custom` provider from PR NVIDIA#648 (supersedes dynamo-specific provider) - Remove k8s/dev/ directory (no longer needed for testing) - Use emptyDir volumes for portability across clusters - Add /etc/hosts workaround for endpoint validation during onboarding - Update README with verification steps for local inference Tested end-to-end with Dynamo vLLM backend. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1e7e5e7 to
01f944a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
k8s/README.md (2)
127-127:⚠️ Potential issue | 🟡 MinorAdd a language identifier to the architecture fenced block.
The fenced code block starting on Line 127 should include a language tag (e.g.,
text) to satisfy markdown lint rules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/README.md` at line 127, The architecture fenced code block in k8s/README.md is missing a language identifier; update the opening fence from ``` to a tagged fence such as ```text (or another appropriate language) so the block satisfies markdown lint rules and renders with correct syntax highlighting.
92-97:⚠️ Potential issue | 🟠 MajorFix endpoint/protocol mismatch in inference examples.
Lines 93, 95, and 107 use
https://inference.local, but the documented deployment path uses the HTTP socat bridge (http://host.openshell.internal:8000/v1/ localhost proxy). These commands are likely to fail as written.Also applies to: 107-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/README.md` around lines 92 - 97, The curl examples in the README use the wrong protocol/host (https://inference.local) which doesn't match the documented socat bridge; update each example (the two curl commands shown for listing models and chat completions and any other occurrences such as the one at line 107) to use the documented HTTP bridge host and port (http://host.openshell.internal:8000/v1 or the equivalent localhost proxy) so the paths and protocol match the deployed inference endpoint.k8s/nemoclaw-k8s.yaml (1)
69-71:⚠️ Potential issue | 🟠 MajorAvoid
curl | bashfor installer execution; add integrity verification.Line 71 still executes remote code directly. Downloading a versioned artifact and verifying checksum/signature before execution is safer and reproducible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/nemoclaw-k8s.yaml` around lines 69 - 71, The line that pipes remote script into bash ("curl -fsSL https://nvidia.com/nemoclaw.sh | bash") should be replaced by a safe, verifiable install flow: download a versioned artifact to a local file (e.g., download https://nvidia.com/nemoclaw-<version>.sh), fetch the corresponding checksum or signature, verify the file integrity (sha256sum or gpg verify) before execution, and only then run the local installer script; update the script invocation around the echo "[4/4] Running NemoClaw installer..." and replace the direct pipe with this download -> verify -> execute sequence using the unique installer invocation string "nemoclaw.sh" to locate and change the code.
🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)
38-123: Set explicit container security contexts (even if DinD remains privileged).
workspaceandinit-docker-configrely on defaults. Please explicitly set least-privilege controls (allowPrivilegeEscalation,capabilities,seccompProfile) to improve security posture and satisfy baseline policy checks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/nemoclaw-k8s.yaml` around lines 38 - 123, Add explicit securityContext blocks for the workspace and init-docker-config containers: for the workspace container (name: workspace) add a securityContext that documents DinD needs (e.g., privileged: true if required), set allowPrivilegeEscalation explicitly, drop all capabilities (capabilities: { drop: ["ALL"] }) and set seccompProfile to RuntimeDefault; for the init container (name: init-docker-config) add a least-privilege securityContext with allowPrivilegeEscalation: false, capabilities drop ALL, seccompProfile: RuntimeDefault, and enforce readOnlyRootFilesystem: true and runAsNonRoot: true where compatible. Ensure these securityContext entries are added under the corresponding container specs (workspace and init-docker-config) so policy checks pass while preserving required DinD permissions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 117-122: Replace the ephemeral emptyDir volumes used for DinD
layers with a host-backed or persistent volume: change the volume definitions
for docker-storage (and consider docker-socket/docker-config if they must
persist) from emptyDir to either a hostPath (pointing to a host directory) or
reference a PersistentVolumeClaim, and update any related volumeMounts and pod
spec to use that host-backed volume; ensure the chosen PersistentVolume/PVC or
hostPath has appropriate permissions and retention so DinD image layers are not
evicted during builds.
---
Duplicate comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 69-71: The line that pipes remote script into bash ("curl -fsSL
https://nvidia.com/nemoclaw.sh | bash") should be replaced by a safe, verifiable
install flow: download a versioned artifact to a local file (e.g., download
https://nvidia.com/nemoclaw-<version>.sh), fetch the corresponding checksum or
signature, verify the file integrity (sha256sum or gpg verify) before execution,
and only then run the local installer script; update the script invocation
around the echo "[4/4] Running NemoClaw installer..." and replace the direct
pipe with this download -> verify -> execute sequence using the unique installer
invocation string "nemoclaw.sh" to locate and change the code.
In `@k8s/README.md`:
- Line 127: The architecture fenced code block in k8s/README.md is missing a
language identifier; update the opening fence from ``` to a tagged fence such as
```text (or another appropriate language) so the block satisfies markdown lint
rules and renders with correct syntax highlighting.
- Around line 92-97: The curl examples in the README use the wrong protocol/host
(https://inference.local) which doesn't match the documented socat bridge;
update each example (the two curl commands shown for listing models and chat
completions and any other occurrences such as the one at line 107) to use the
documented HTTP bridge host and port (http://host.openshell.internal:8000/v1 or
the equivalent localhost proxy) so the paths and protocol match the deployed
inference endpoint.
---
Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 38-123: Add explicit securityContext blocks for the workspace and
init-docker-config containers: for the workspace container (name: workspace) add
a securityContext that documents DinD needs (e.g., privileged: true if
required), set allowPrivilegeEscalation explicitly, drop all capabilities
(capabilities: { drop: ["ALL"] }) and set seccompProfile to RuntimeDefault; for
the init container (name: init-docker-config) add a least-privilege
securityContext with allowPrivilegeEscalation: false, capabilities drop ALL,
seccompProfile: RuntimeDefault, and enforce readOnlyRootFilesystem: true and
runAsNonRoot: true where compatible. Ensure these securityContext entries are
added under the corresponding container specs (workspace and init-docker-config)
so policy checks pass while preserving required DinD permissions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6142f3a7-46ea-461e-b9f6-9c3ad360eeba
📒 Files selected for processing (2)
k8s/README.mdk8s/nemoclaw-k8s.yaml
- Remove multi-document YAML (move namespace creation to README) - Add language specifier to fenced code block (```text) - Add blank lines before lists per markdownlint rules Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add explicit experimental warning at top of README - Clarify this is for trying NemoClaw on k8s, not production - Document privileged pod and DinD requirements upfront - Add resource requirements to prerequisites Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)
15-15: Pin container images by digest for reproducibility.Using floating tags (
docker:24-dind,node:22,busybox) can silently change behavior between runs. Prefer immutable digest pins, especially for repeatable test infra.📌 Example pinning pattern
- image: docker:24-dind + image: docker:24-dind@sha256:<docker-dind-digest> ... - image: node:22 + image: node:22@sha256:<node-digest> ... - image: busybox + image: busybox@sha256:<busybox-digest>Also applies to: 36-36, 105-105
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/nemoclaw-k8s.yaml` at line 15, The image fields currently use floating tags (e.g., the entries with "image: docker:24-dind", "image: node:22", and "image: busybox") which should be replaced with immutable digest-pinned references; update each image value to the corresponding registry digest form (e.g., docker:24-dind@sha256:<digest>) for the occurrences referenced in the diff (the image lines at 15, 36, and 105) so the manifest always pulls an exact, reproducible image; obtain the correct sha256 digest from the image registry (Docker Hub or your registry) and replace the tag-only values with the full digest-pinned strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 11-12: Add automountServiceAccountToken: false at the Pod spec
level to disable automatic mounting of the service account token; update the
top-level spec (the same block that contains containers) to include
automountServiceAccountToken: false so the pod running Docker-in-Docker and the
NemoClaw installer will not receive a mounted service-account token.
- Around line 50-53: The socat bridge is started in the background (the `socat
TCP-LISTEN:8000,... &` line) but not validated; modify this block to capture its
PID ($!), wait briefly and verify it is listening or still running (e.g., loop
with `kill -0 $SOCAT_PID` or test port 8000 with `nc`/`ss`), and if it has
exited or the port is not open within a short timeout log an error and exit
non‑zero so onboarding fails fast; do not rely on `set -e` for background
processes and keep the existing hosts entry (`echo "127.0.0.1
host.openshell.internal" >> /etc/hosts`) logic.
---
Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Line 15: The image fields currently use floating tags (e.g., the entries with
"image: docker:24-dind", "image: node:22", and "image: busybox") which should be
replaced with immutable digest-pinned references; update each image value to the
corresponding registry digest form (e.g., docker:24-dind@sha256:<digest>) for
the occurrences referenced in the diff (the image lines at 15, 36, and 105) so
the manifest always pulls an exact, reproducible image; obtain the correct
sha256 digest from the image registry (Docker Hub or your registry) and replace
the tag-only values with the full digest-pinned strings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7befcf59-fe03-4fb4-8018-3570724a1e91
📒 Files selected for processing (2)
k8s/README.mdk8s/nemoclaw-k8s.yaml
✅ Files skipped from review due to trivial changes (1)
- k8s/README.md
| spec: | ||
| containers: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n k8s/nemoclaw-k8s.yamlRepository: NVIDIA/NemoClaw
Length of output: 4492
Disable automatic service-account token mount for this pod.
The pod does not call the Kubernetes API—it only runs Docker-in-Docker and the NemoClaw installer. Mounting a token by default increases blast radius if compromised. Add automountServiceAccountToken: false at the spec level.
🔒 Proposed hardening
spec:
+ automountServiceAccountToken: false
containers:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| spec: | |
| containers: | |
| spec: | |
| automountServiceAccountToken: false | |
| containers: |
🧰 Tools
🪛 Trivy (0.69.3)
[error] 11-119: Default security context configured
pod nemoclaw in nemoclaw namespace is using the default security context, which allows root privileges
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/nemoclaw-k8s.yaml` around lines 11 - 12, Add
automountServiceAccountToken: false at the Pod spec level to disable automatic
mounting of the service account token; update the top-level spec (the same block
that contains containers) to include automountServiceAccountToken: false so the
pod running Docker-in-Docker and the NemoClaw installer will not receive a
mounted service-account token.
| socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST & | ||
| # Add hosts entry so validation can reach socat via host.openshell.internal | ||
| echo "127.0.0.1 host.openshell.internal" >> /etc/hosts | ||
| sleep 1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "nemoclaw-k8s.yaml" -type fRepository: NVIDIA/NemoClaw
Length of output: 82
🏁 Script executed:
cat -n ./k8s/nemoclaw-k8s.yaml | head -80Repository: NVIDIA/NemoClaw
Length of output: 3131
🏁 Script executed:
cat -n ./k8s/nemoclaw-k8s.yaml | sed -n '40,70p'Repository: NVIDIA/NemoClaw
Length of output: 1416
Fail fast if the socat bridge does not start.
At line 50, socat is backgrounded but its health is never validated. If it exits immediately, onboarding continues and fails later with less actionable errors. Note that set -e only applies to foreground commands, not background processes.
🛠️ Proposed reliability check
- socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+ : "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}"
+ socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+ SOCAT_PID=$!
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
+ kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST & | |
| # Add hosts entry so validation can reach socat via host.openshell.internal | |
| echo "127.0.0.1 host.openshell.internal" >> /etc/hosts | |
| sleep 1 | |
| : "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}" | |
| socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST & | |
| SOCAT_PID=$! | |
| # Add hosts entry so validation can reach socat via host.openshell.internal | |
| echo "127.0.0.1 host.openshell.internal" >> /etc/hosts | |
| sleep 1 | |
| kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@k8s/nemoclaw-k8s.yaml` around lines 50 - 53, The socat bridge is started in
the background (the `socat TCP-LISTEN:8000,... &` line) but not validated;
modify this block to capture its PID ($!), wait briefly and verify it is
listening or still running (e.g., loop with `kill -0 $SOCAT_PID` or test port
8000 with `nc`/`ss`), and if it has exited or the port is not open within a
short timeout log an error and exit non‑zero so onboarding fails fast; do not
rely on `set -e` for background processes and keep the existing hosts entry
(`echo "127.0.0.1 host.openshell.internal" >> /etc/hosts`) logic.
* feat: add Kubernetes testing infrastructure Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes with Dynamo vLLM inference. Includes: - test-installer.sh: Public installer test (requires unattended install support) - setup.sh: Manual setup from source for development - Pod manifests for Docker-in-Docker execution Architecture: OpenShell runs k3s inside Docker, so we use DinD pods to provide the Docker daemon on Kubernetes. Signed-off-by: rwipfelnv * fix: add socat proxy for K8s DNS isolation workaround OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names, so inference requests fail with 502 Bad Gateway. This adds: - socat TCP proxy setup in setup.sh to forward localhost:8000 to the K8s vLLM service endpoint - Provider configuration using host.openshell.internal:8000 which resolves to the workspace container from inside k3s - Documentation explaining the network architecture and workaround - Updated env var names to match PR NVIDIA#318 (NEMOCLAW_NON_INTERACTIVE) - cgroup v2 compatibility fix for Docker daemon - Removed memory limits that caused OOM Tested: Inference requests from sandboxes now route correctly through the socat proxy to the Dynamo vLLM endpoint. Depends on: NVIDIA#318 (non-interactive mode), NVIDIA#365 (Dynamo provider) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support Complete K8s deployment solution for NemoClaw: - nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage - install.sh: Interactive installer with preflight checks - Rename k8s-testing -> k8s, move old files to dev/ Key learnings: - hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction - Init containers for docker config, openshell CLI, NemoClaw build - Workspace container installs apt packages at runtime (can't share via volumes) - socat proxy bridges K8s DNS to nested k3s (host.openshell.internal) Tested successfully with Dynamo vLLM backend on EKS. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * fix: rename NemoKlaw to NemoClaw and document known limitations Address PR feedback: - Rename NemoKlaw -> NemoClaw (avoid confusing naming) - Rename nemoklaw.yaml -> nemoclaw-k8s.yaml - Fix hardcoded endpoint to use generic example - Remove log file from repo - Document known limitations (HTTPS proxy issue) - Update README with accurate status of what works/doesn't work Signed-off-by: rwipfelnv Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: update DYNAMO_HOST to vllm-agg-frontend The aggregated frontend service is the correct endpoint for Dynamo vLLM inference. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * docs: add Using NemoClaw section with CLI commands - Add workspace shell access command - Add sandbox status/logs/list commands - Add chat completion test example - Rename section from "What Can You Do?" to "Using NemoClaw" Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * refactor(k8s): simplify deployment to use official installer - Use official NemoClaw installer (`curl | bash`) instead of git clone/build - Switch to `custom` provider from PR NVIDIA#648 (supersedes dynamo-specific provider) - Remove k8s/dev/ directory (no longer needed for testing) - Use emptyDir volumes for portability across clusters - Add /etc/hosts workaround for endpoint validation during onboarding - Update README with verification steps for local inference Tested end-to-end with Dynamo vLLM backend. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(k8s): resolve lint errors in yaml and markdown - Remove multi-document YAML (move namespace creation to README) - Add language specifier to fenced code block (```text) - Add blank lines before lists per markdownlint rules Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * docs(k8s): add experimental warning and clarify requirements - Add explicit experimental warning at top of README - Clarify this is for trying NemoClaw on k8s, not production - Document privileged pod and DinD requirements upfront - Add resource requirements to prerequisites Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Signed-off-by: rwipfelnv Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: KJ <kejones@nvidia.com>
Summary
k8s-testing/directory with scripts and manifests for testing NemoClaw on KubernetesFiles
test-installer.shnemoclaw-installer-test.yamlsetup.shopenshell-gateway.yamlDependencies
test-installer.shto workTest plan
NEMOCLAW_DYNAMO_ENDPOINTenv var🤖 Generated with Claude Code
Summary by CodeRabbit
Documentation
Chores