Skip to content

feat: add Kubernetes testing infrastructure#227

Merged
kjw3 merged 12 commits intoNVIDIA:mainfrom
rwipfelnv:rwipfelnv/k8s-testing
Mar 30, 2026
Merged

feat: add Kubernetes testing infrastructure#227
kjw3 merged 12 commits intoNVIDIA:mainfrom
rwipfelnv:rwipfelnv/k8s-testing

Conversation

@rwipfelnv
Copy link
Copy Markdown
Contributor

@rwipfelnv rwipfelnv commented Mar 17, 2026

Summary

  • Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes
  • Supports Dynamo vLLM as inference backend
  • Uses Docker-in-Docker (DinD) to provide Docker daemon on K8s (OpenShell runs k3s inside Docker)

Files

File Description
test-installer.sh Public installer test script (recommended)
nemoclaw-installer-test.yaml K8s pod manifest for public installer test
setup.sh Manual setup from source (development)
openshell-gateway.yaml K8s pod manifest for manual setup

Dependencies

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation

    • Added an experimental Kubernetes deployment guide with quick start, configuration instructions, operational commands, architecture diagram, and troubleshooting steps for running NemoClaw with GPU inference.
  • Chores

    • Added a Kubernetes manifest to deploy a containerized NemoClaw workspace with GPU inference support, configurable environment variables, and runtime access/diagnostics for onboarding and inference testing.

Add k8s-testing/ directory with scripts and manifests for testing NemoClaw
on Kubernetes with Dynamo vLLM inference.

Includes:
- test-installer.sh: Public installer test (requires unattended install support)
- setup.sh: Manual setup from source for development
- Pod manifests for Docker-in-Docker execution

Architecture: OpenShell runs k3s inside Docker, so we use DinD pods
to provide the Docker daemon on Kubernetes.

Signed-off-by: rwipfelnv
@wscurran wscurran added Docker Support for Docker containerization K8s Use this label to identify Kubernetes deployment issues with NemoClaw. enhancement New feature or request labels Mar 18, 2026
rwipfelnv and others added 2 commits March 19, 2026 00:32
OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names,
so inference requests fail with 502 Bad Gateway. This adds:

- socat TCP proxy setup in setup.sh to forward localhost:8000 to the
  K8s vLLM service endpoint
- Provider configuration using host.openshell.internal:8000 which
  resolves to the workspace container from inside k3s
- Documentation explaining the network architecture and workaround
- Updated env var names to match PR NVIDIA#318 (NEMOCLAW_NON_INTERACTIVE)
- cgroup v2 compatibility fix for Docker daemon
- Removed memory limits that caused OOM

Tested: Inference requests from sandboxes now route correctly through
the socat proxy to the Dynamo vLLM endpoint.

Depends on: NVIDIA#318 (non-interactive mode), NVIDIA#365 (Dynamo provider)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete K8s deployment solution for NemoClaw:
- nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage
- install.sh: Interactive installer with preflight checks
- Rename k8s-testing -> k8s, move old files to dev/

Key learnings:
- hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction
- Init containers for docker config, openshell CLI, NemoClaw build
- Workspace container installs apt packages at runtime (can't share via volumes)
- socat proxy bridges K8s DNS to nested k3s (host.openshell.internal)

Tested successfully with Dynamo vLLM backend on EKS.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
@rwipfelnv
Copy link
Copy Markdown
Contributor Author

NemoKlaw Testing Results ✅

Successfully tested NemoKlaw (NemoClaw on Kubernetes) with Dynamo vLLM backend on EKS.

What's Working

  • Unattended onboard: Full automated setup from kubectl apply to working sandbox
  • Dynamo integration: Inference via meta-llama/Llama-3.1-8B-Instruct through socat proxy
  • Docker-in-Docker: Sandbox image builds inside k3s inside DinD
  • Init containers: Clean separation of build-time vs runtime steps

Key Fix: hostPath Storage

The main blocker was ephemeral storage eviction during sandbox image builds. Fixed by using hostPath storage on the NVMe RAID array (/mnt/k8s-disks) instead of emptyDir.

Files Added/Updated

  • k8s/nemoklaw.yaml - Complete pod manifest with init containers
  • k8s/install.sh - Interactive installer with preflight checks
  • k8s/README.md - NemoKlaw documentation and quick start
  • k8s/nemoklaw-onboard.log - Sample successful onboard output

Sample Output

[1/5] Installing packages...
[2/5] Starting socat proxy...
[3/5] Waiting for Docker daemon...
[4/5] Running NemoClaw onboard...

  [1/7] Preflight checks ✓
  [2/7] Starting OpenShell gateway ✓
  [3/7] Creating sandbox ✓ (built 22-step Dockerfile)
  [4/7] Configuring inference (Dynamo) ✓
  [5/7] Setting up inference provider ✓
  [6/7] Setting up OpenClaw inside sandbox ✓
  [7/7] Policy presets ✓

[5/5] Onboard complete. Container staying alive.

Next Steps

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice to see thanks. I left some comments.

We should also work with @jayavenkatesh19 to get this running in GitHub Actions. What is the hardware requirement for this? I expect we will need a GPU CI runner.

rwipfelnv and others added 2 commits March 19, 2026 14:31
Address PR feedback:
- Rename NemoKlaw -> NemoClaw (avoid confusing naming)
- Rename nemoklaw.yaml -> nemoclaw-k8s.yaml
- Fix hardcoded endpoint to use generic example
- Remove log file from repo
- Document known limitations (HTTPS proxy issue)
- Update README with accurate status of what works/doesn't work

Signed-off-by: rwipfelnv
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The aggregated frontend service is the correct endpoint for
Dynamo vLLM inference.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
@rwipfelnv
Copy link
Copy Markdown
Contributor Author

K8s Testing Results - NemoClaw on Kubernetes

Tested NemoClaw deployment on Kubernetes with Dynamo vLLM inference endpoint using the manifests in this PR.

Onboard Logs

[1/5] Installing packages...
[2/5] Starting socat proxy...
[3/5] Waiting for Docker daemon...
Docker ready
[4/5] Running NemoClaw onboard...

  NemoClaw Onboarding
  (non-interactive mode)
  ===================

  [1/7] Preflight checks
  ──────────────────────────────────────────────────
  ✓ Docker is running
  ✓ Container runtime: docker
  ✓ openshell CLI: openshell 0.0.14
  ✓ Port 8080 available (OpenShell gateway)
  ✓ Port 18789 available (NemoClaw dashboard)
  ⓘ No GPU detected — will use cloud inference

  [2/7] Starting OpenShell gateway
  ──────────────────────────────────────────────────
  ✓ Gateway ready

  [3/7] Creating sandbox
  ──────────────────────────────────────────────────
  ✓ Sandbox 'my-assistant' created

  [4/7] Configuring inference (NIM)
  ──────────────────────────────────────────────────
  [non-interactive] Using Dynamo provider

  [5/7] Setting up inference provider
  ──────────────────────────────────────────────────
  ✓ Created provider dynamo
  Route: inference.local
  Provider: dynamo
  Model: meta-llama/Llama-3.1-8B-Instruct
  ✓ Inference route set: dynamo / meta-llama/Llama-3.1-8B-Instruct

  [6/7] Setting up OpenClaw inside sandbox
  ──────────────────────────────────────────────────
  ✓ OpenClaw gateway launched inside sandbox

  [7/7] Policy presets
  ──────────────────────────────────────────────────
  [non-interactive] Skipping policy presets.

  ──────────────────────────────────────────────────
  Sandbox      my-assistant (Landlock + seccomp + netns)
  Model        meta-llama/Llama-3.1-8B-Instruct (Dynamo vLLM)
  NIM          not running
  ──────────────────────────────────────────────────

[5/5] Onboard complete. Container staying alive.

Testing Dynamo Inference

From inside the sandbox, the Dynamo endpoint is accessible via host.openshell.internal:8000 (bridged by socat proxy):

# List available models
wget -qO- http://host.openshell.internal:8000/v1/models
# {"object":"list","data":[{"id":"meta-llama/Llama-3.1-8B-Instruct","object":"model","created":1774352462,"owned_by":"nvidia"}]}

# Test chat completions
wget -qO- --post-data='{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Say hello"}],"max_tokens":20}' \
  --header='Content-Type: application/json' \
  http://host.openshell.internal:8000/v1/chat/completions
# {"choices":[{"message":{"content":"Hello. How can I assist you today?","role":"assistant"}}],"model":"meta-llama/Llama-3.1-8B-Instruct"}

How the K8s Deployment Works

  1. socat proxy bridges K8s DNS to the nested k3s environment (OpenShell sandbox)
  2. DYNAMO_HOST env var configures the external Dynamo service endpoint
  3. NEMOCLAW_DYNAMO_ENDPOINT uses host.openshell.internal:8000/v1 to reach Dynamo from inside the sandbox

Deployment Steps

# Deploy Dynamo (if not already running)
kubectl apply -f dynamo-manifests/

# Deploy NemoClaw
kubectl apply -f k8s/nemoclaw-k8s.yaml

# Check logs
kubectl logs nemoclaw -n nemoclaw -c workspace -f

✅ K8s deployment with Dynamo inference working correctly.

Note: Requires PR #365 for Dynamo provider support in onboard.js.

- Add workspace shell access command
- Add sandbox status/logs/list commands
- Add chat completion test example
- Rename section from "What Can You Do?" to "Using NemoClaw"

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
@kjw3 kjw3 self-assigned this Mar 24, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Adds an experimental Kubernetes deployment and documentation for running NemoClaw with a privileged Docker-in-Docker pod, socat proxying to an external inference endpoint, and an automated installer workflow for GPU inference within a k3s sandboxed environment.

Changes

Cohort / File(s) Summary
Kubernetes Manifest
k8s/nemoclaw-k8s.yaml
New Pod manifest for namespace nemoclaw: privileged DinD sidecar (docker:24-dind) + workspace container (node:22), init container writes Docker daemon config, socat proxies port 8000 to $DYNAMO_HOST, workspace waits for dockerd then runs NemoClaw installer; env vars exposed for configuration.
Deployment Documentation
k8s/README.md
New guide with Quick Start, required env vars (DYNAMO_HOST, NEMOCLAW_ENDPOINT_URL, COMPATIBLE_API_KEY, NEMOCLAW_MODEL, optional NEMOCLAW_SANDBOX_NAME), example env: block, workspace access and sandbox commands, inference testing steps, architecture diagram describing DinD + k3s sandbox + socat proxy, and troubleshooting tips.

Sequence Diagram(s)

sequenceDiagram
    participant K8s as Kubernetes API
    participant Init as init container
    participant DinD as docker:dind sidecar
    participant Workspace as workspace container
    participant Socat as socat proxy
    participant Inference as Dynamo/vLLM endpoint

    K8s->>Init: create pod & run init
    Init->>DinD: write /etc/docker/daemon.json (host cgroupns)
    K8s->>DinD: start dockerd (DinD)
    DinD-->>Workspace: expose docker socket (/var/run/docker.sock)
    Workspace->>DinD: poll `docker info` until ready
    Workspace->>Socat: start socat proxy listening :8000
    Socat->>Inference: forward traffic to $DYNAMO_HOST
    Workspace->>Inference: run NemoClaw installer (connect to NEMOCLAW_ENDPOINT_URL)
    Workspace->>Workspace: keep running (sleep infinity) after onboarding
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I tunneled through pods by moonlit code,

DinD and socat in a sandboxed node,
I stitched the endpoint with a hop and a cheer,
NemoClaw now hums — GPUs near! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding Kubernetes deployment infrastructure with manifests and documentation for testing NemoClaw on Kubernetes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (1)
k8s/dev/setup.sh (1)

14-16: ⚠️ Potential issue | 🟡 Minor

Default endpoint uses personal namespace.

The default VLLM_ENDPOINT references robert.svc.cluster.local, which appears to be a user-specific namespace. This should use a more generic default or placeholder.

Proposed fix
 # vLLM endpoint (Dynamo)
-VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://vllm-agg-frontend.robert.svc.cluster.local:8000/v1}"
+VLLM_ENDPOINT="${VLLM_ENDPOINT:-http://vllm-agg-frontend.dynamo.svc.cluster.local:8000/v1}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/dev/setup.sh` around lines 14 - 16, The default VLLM_ENDPOINT in setup.sh
uses a personal namespace ("robert.svc.cluster.local"); update the VLLM_ENDPOINT
default to a generic or placeholder host (e.g., localhost,
vllm.svc.cluster.local, or a clearly documented placeholder) so it doesn't
reference a user-specific namespace; modify the VLLM_ENDPOINT variable
definition (VLLM_ENDPOINT="${VLLM_ENDPOINT:-...}") to the chosen generic default
and add a brief inline comment if needed to indicate it should be overridden in
deployment.
🧹 Nitpick comments (4)
k8s/dev/setup.sh (1)

7-8: Unused variable OPENSHELL_DIR.

OPENSHELL_DIR is defined but never used in the script. Consider removing it or documenting its intended purpose.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/dev/setup.sh` around lines 7 - 8, The script defines OPENSHELL_DIR
(OPENSHELL_DIR="${OPENSHELL_DIR:-$NEMOCLAW_DIR/../OpenShell}") but never uses
it; either remove this unused variable declaration or implement its intended
usage where needed. Locate the declaration near the existing NEMOCLAW_DIR and
SCRIPT_DIR logic, then either delete the OPENSHELL_DIR line to avoid dead
variables or add the appropriate references to OPENSHELL_DIR where the OpenShell
path is required (e.g., export it, use it for path resolution, or document its
purpose).
k8s/install.sh (1)

84-85: Deprecated kubectl version --short flag.

The --short flag for kubectl version is deprecated. The fallback uses JSON parsing which is the correct approach, but the primary command will emit a deprecation warning.

Proposed fix
-info "kubectl found: $(kubectl version --client --short 2>/dev/null || kubectl version --client -o json | grep gitVersion | head -1)"
+info "kubectl found: $(kubectl version --client -o json 2>/dev/null | jq -r '.clientVersion.gitVersion' 2>/dev/null || kubectl version --client 2>&1 | head -1)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/install.sh` around lines 84 - 85, Replace the deprecated `kubectl version
--client --short` usage in the info call with the JSON-based client version
parsing: run `kubectl version --client -o json`, extract the client gitVersion
(e.g., parsing `.clientVersion.gitVersion`), and then pass that extracted
version into the info message; update the invocation around the info function so
it no longer uses `--short` and always falls back to the JSON parsing logic
currently used.
k8s/dev/openshell-gateway.yaml (1)

6-7: Consider using a dedicated namespace instead of default.

Deploying to the default namespace may conflict with other workloads and makes cleanup harder. The other manifests use the nemoclaw namespace. For consistency and isolation, consider using the same namespace or making it configurable.

Proposed change
 metadata:
   name: openshell-gateway
-  namespace: default
+  namespace: nemoclaw
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/dev/openshell-gateway.yaml` around lines 6 - 7, Update the
Deployment/Service manifest for openshell-gateway to avoid using the default
namespace: replace the hard-coded namespace value "default" with the dedicated
namespace used by other manifests (e.g., "nemoclaw") or switch it to a
configurable placeholder so deployments can target the proper namespace; locate
the resource named "openshell-gateway" and change its namespace field
accordingly and ensure any RBAC/ServiceAccount references (if present) align
with the new namespace.
k8s/dev/nemoclaw-installer-test.yaml (1)

89-95: Using emptyDir for Docker storage may cause ephemeral storage exhaustion.

The main manifest (k8s/nemoclaw-k8s.yaml) explicitly uses hostPath with a note that emptyDir causes ephemeral storage eviction during sandbox image builds. This test manifest uses emptyDir which could make tests flaky on clusters with limited ephemeral storage.

Consider whether this is intentional for simpler test cleanup, or if it should align with the main manifest's approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/dev/nemoclaw-installer-test.yaml` around lines 89 - 95, The test manifest
uses emptyDir for the volumes "docker-storage", "docker-socket", and
"docker-config", which can exhaust ephemeral node storage during sandbox image
builds; update these volume definitions to match the main manifest by switching
from emptyDir to hostPath (or another persistent volume type) for
"docker-storage" (and consider hostPath for the others) so builds won't be
evicted, or if ephemeral storage is intentional add a comment and make the
choice configurable for CI; locate and modify the volume entries named
docker-storage, docker-socket, and docker-config in the test manifest to
implement this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/dev/nemoclaw-installer-test.yaml`:
- Around line 60-63: Replace the mismatched env var names
NEMOCLAW_DYNAMO_ENDPOINT and NEMOCLAW_DYNAMO_MODEL with the expected names
NEMOCLAW_ENDPOINT_URL and NEMOCLAW_MODEL so the onboard consumer reads them;
update the values only (keep the same endpoint and model strings) and ensure any
references in the deployment spec or container env section that currently use
NEMOCLAW_DYNAMO_* now point to NEMOCLAW_ENDPOINT_URL and NEMOCLAW_MODEL to match
the loader that reads those exact keys.

In `@k8s/dev/setup.sh`:
- Around line 181-185: Update the example curl in setup.sh to use HTTP rather
than HTTPS to match k8s/README.md: replace the request URL in the curl example
that currently targets "https://inference.local/v1/chat/completions" with
"http://inference.local/v1/chat/completions" (the rest of the invocation using
$VLLM_MODEL and the JSON payload stays the same) so the inline sandbox test
reflects the documented working protocol.
- Around line 119-131: The echo and variable expansion inside the bash -c string
are broken because $VLLM_HOST_PORT is inside single quotes; change the inner
echo to use double quotes (or otherwise ensure $VLLM_HOST_PORT is not
single-quoted) so the variable expands, and verify the nohup socat
TCP:$VLLM_HOST_PORT invocation is also evaluated in the same context; update the
echo line ("socat proxy started: localhost:8000 -> $VLLM_HOST_PORT") and any
other single-quoted occurrences that should expand the POD_NAME, NAMESPACE or
VLLM_HOST_PORT variables to use double quotes or proper escaping.

In `@k8s/dev/test-installer.sh`:
- Around line 41-42: DYNAMO_MODEL is being defined but never used; either pass
it into the generated YAML or remove it. If you want to keep it, update the code
that prepares the manifest (the sed/substitution logic that currently looks at
NEMOCLAW_DYNAMO_ENDPOINT) to also substitute DYNAMO_MODEL into the template
(reference DYNAMO_MODEL and the sed/replace step), or if the model value should
come from the YAML default, remove the DYNAMO_MODEL variable entirely. Also
change the default DYNAMO_ENDPOINT from the user-specific "robert" namespace to
a neutral default or document that it must be overridden (reference
NEMOCLAW_DYNAMO_ENDPOINT).

In `@k8s/install.sh`:
- Around line 319-323: The script is cloning a personal-fork branch (git clone
... --branch rwipfelnv/dynamo-support https://github.com/rwipfelnv/NemoClaw.git)
— replace the fork/branch reference with the official upstream repository and
appropriate branch (e.g., change the URL to
https://github.com/NemoClaw/NemoClaw.git and set --branch to the official branch
name such as main or the upstream dynamo-support branch) so that the git clone
in the block that uses "git clone --depth 1 --branch rwipfelnv/dynamo-support
https://github.com/rwipfelnv/NemoClaw.git nemoclaw-src" pulls from the official
repo before merging.
- Around line 229-236: The env var names in this manifest are inconsistent with
the onboarding/verification script: change the incorrectly named
NEMOCLAW_DYNAMO_ENDPOINT and NEMOCLAW_DYNAMO_MODEL to the canonical env var
names used by the onboarding code (match the expected names in
nemoclaw-k8s.yaml/verification script), keeping NEMOCLAW_PROVIDER="dynamo" and
NEMOCLAW_SANDBOX_NAME unchanged; update any references to these symbols so the
onboarding function reads the correct variables (verify by comparing the exact
env var identifiers in the verification manifest).
- Around line 298-306: The HOST_PORT extraction fails for endpoints starting
with https://; update the HOST_PORT computation (the code that parses ENDPOINT
into HOST_PORT) to strip both http:// and https:// prefixes and then remove the
trailing /v1, e.g. use a single sed that handles optional "s" (s|^https\?://||)
or use shell parameter expansion (strip protocol then remove /v1) so HOST_PORT
is correct for both HTTP and HTTPS endpoints; modify the lines that set ENDPOINT
and HOST_PORT accordingly.

In `@k8s/nemoclaw-k8s.yaml`:
- Around line 148-153: The init container currently clones the personal
fork/branch "rwipfelnv/dynamo-support" from "rwipfelnv/NemoClaw.git"; update the
git clone command to point to the official NVIDIA repository and a stable branch
(e.g., change "git clone --depth 1 --branch rwipfelnv/dynamo-support
https://github.com/rwipfelnv/NemoClaw.git" to use
"https://github.com/NVIDIA/NemoClaw.git" and a production branch like "main" or
a specific release tag) so the commands in the block (git clone, npm install,
npm run build) use the upstream source instead of a personal fork.

In `@k8s/README.md`:
- Around line 98-100: The fenced code blocks in README.md for the shell prompt
and the architecture diagram are missing language identifiers; update the
opening backtick fences for the examples around the shell prompt (the block
containing "sandbox@my-assistant:~$") and the ASCII architecture diagram (the
block starting with the box top "┌────────────────…") to include a language tag
such as text or plaintext (e.g., change ``` to ```text) so static analysis
recognizes them; ensure you change both the block around the prompt and the
larger diagram block (lines indicated in the review) to use the same tag.
- Around line 239-253: Replace the emoji status markers in the markdown tables
with text alternatives to match the "No emoji in technical prose" guideline:
change ✅ to "Working" (or "Yes") and ❌ to "Blocked" (or "No") in the status
column entries for rows such as "Unattended onboard", "Sandbox creation",
"Connect to sandbox", "HTTP inference", "Socat proxy bridge", and the "What
Doesn't Work Yet" rows "openclaw tui", "openclaw agent", "HTTPS inference";
ensure the table alignment and pipe separators remain valid and update any
inline notes that reference emojis to use the same textual statuses.

---

Duplicate comments:
In `@k8s/dev/setup.sh`:
- Around line 14-16: The default VLLM_ENDPOINT in setup.sh uses a personal
namespace ("robert.svc.cluster.local"); update the VLLM_ENDPOINT default to a
generic or placeholder host (e.g., localhost, vllm.svc.cluster.local, or a
clearly documented placeholder) so it doesn't reference a user-specific
namespace; modify the VLLM_ENDPOINT variable definition
(VLLM_ENDPOINT="${VLLM_ENDPOINT:-...}") to the chosen generic default and add a
brief inline comment if needed to indicate it should be overridden in
deployment.

---

Nitpick comments:
In `@k8s/dev/nemoclaw-installer-test.yaml`:
- Around line 89-95: The test manifest uses emptyDir for the volumes
"docker-storage", "docker-socket", and "docker-config", which can exhaust
ephemeral node storage during sandbox image builds; update these volume
definitions to match the main manifest by switching from emptyDir to hostPath
(or another persistent volume type) for "docker-storage" (and consider hostPath
for the others) so builds won't be evicted, or if ephemeral storage is
intentional add a comment and make the choice configurable for CI; locate and
modify the volume entries named docker-storage, docker-socket, and docker-config
in the test manifest to implement this change.

In `@k8s/dev/openshell-gateway.yaml`:
- Around line 6-7: Update the Deployment/Service manifest for openshell-gateway
to avoid using the default namespace: replace the hard-coded namespace value
"default" with the dedicated namespace used by other manifests (e.g.,
"nemoclaw") or switch it to a configurable placeholder so deployments can target
the proper namespace; locate the resource named "openshell-gateway" and change
its namespace field accordingly and ensure any RBAC/ServiceAccount references
(if present) align with the new namespace.

In `@k8s/dev/setup.sh`:
- Around line 7-8: The script defines OPENSHELL_DIR
(OPENSHELL_DIR="${OPENSHELL_DIR:-$NEMOCLAW_DIR/../OpenShell}") but never uses
it; either remove this unused variable declaration or implement its intended
usage where needed. Locate the declaration near the existing NEMOCLAW_DIR and
SCRIPT_DIR logic, then either delete the OPENSHELL_DIR line to avoid dead
variables or add the appropriate references to OPENSHELL_DIR where the OpenShell
path is required (e.g., export it, use it for path resolution, or document its
purpose).

In `@k8s/install.sh`:
- Around line 84-85: Replace the deprecated `kubectl version --client --short`
usage in the info call with the JSON-based client version parsing: run `kubectl
version --client -o json`, extract the client gitVersion (e.g., parsing
`.clientVersion.gitVersion`), and then pass that extracted version into the info
message; update the invocation around the info function so it no longer uses
`--short` and always falls back to the JSON parsing logic currently used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 495b5934-f061-4f76-b4fa-0e5ac8381a18

📥 Commits

Reviewing files that changed from the base of the PR and between b2164e7 and 412e6aa.

📒 Files selected for processing (7)
  • k8s/README.md
  • k8s/dev/nemoclaw-installer-test.yaml
  • k8s/dev/openshell-gateway.yaml
  • k8s/dev/setup.sh
  • k8s/dev/test-installer.sh
  • k8s/install.sh
  • k8s/nemoclaw-k8s.yaml

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
k8s/README.md (1)

127-148: ⚠️ Potential issue | 🟡 Minor

Add a language identifier to the architecture fenced block.

The code fence starting at Line 127 is missing a language tag (text/plaintext), which trips markdown linting.

Proposed fix
-```
+```text
 ┌─────────────────────────────────────────────────────────────────┐
 ...
-```
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/README.md` around lines 127 - 148, The fenced ASCII-art block beginning
with the triple backticks (``` ) that contains the "Kubernetes Cluster" diagram
is missing a language tag; update its opening fence to include a language
identifier (e.g., change ``` to ```text) so the block is treated as plain text
by Markdown linters and renderers—look for the block that starts with the box
drawing "┌─────────────────────────────────────────────────────────────────┐"
and modify its opening fence accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 69-71: Replace the inline pipe-exec of the remote installer ("curl
-fsSL https://nvidia.com/nemoclaw.sh | bash") with a download-then-verify flow:
curl or wget the installer into a local file (e.g., nemoclaw.sh), obtain and
verify a pinned checksum or GPG signature for that exact release (or fetch a
versioned artifact URL instead of the root URL), only after successful
verification set secure permissions and execute the local file (bash
./nemoclaw.sh); ensure failure paths abort the job and log the verification
error.

In `@k8s/README.md`:
- Around line 92-97: The README's curl examples use https://inference.local but
the deployment in this PR routes through socat over plain HTTP; update the two
examples (the model list curl and the chat completions curl showing model
"meta-llama/Llama-3.1-8B-Instruct") and the later example at line 107 to use the
configured HTTP endpoint (http://host.openshell.internal:8000/v1) instead of
https://inference.local so the documented commands match the running proxy and
will succeed.

---

Duplicate comments:
In `@k8s/README.md`:
- Around line 127-148: The fenced ASCII-art block beginning with the triple
backticks (``` ) that contains the "Kubernetes Cluster" diagram is missing a
language tag; update its opening fence to include a language identifier (e.g.,
change ``` to ```text) so the block is treated as plain text by Markdown linters
and renderers—look for the block that starts with the box drawing
"┌─────────────────────────────────────────────────────────────────┐" and modify
its opening fence accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4c5bce22-7b14-4615-8e74-e9df88aa1560

📥 Commits

Reviewing files that changed from the base of the PR and between 412e6aa and 1e7e5e7.

📒 Files selected for processing (2)
  • k8s/README.md
  • k8s/nemoclaw-k8s.yaml

- Use official NemoClaw installer (`curl | bash`) instead of git clone/build
- Switch to `custom` provider from PR NVIDIA#648 (supersedes dynamo-specific provider)
- Remove k8s/dev/ directory (no longer needed for testing)
- Use emptyDir volumes for portability across clusters
- Add /etc/hosts workaround for endpoint validation during onboarding
- Update README with verification steps for local inference

Tested end-to-end with Dynamo vLLM backend.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rwipfelnv rwipfelnv force-pushed the rwipfelnv/k8s-testing branch from 1e7e5e7 to 01f944a Compare March 25, 2026 16:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
k8s/README.md (2)

127-127: ⚠️ Potential issue | 🟡 Minor

Add a language identifier to the architecture fenced block.

The fenced code block starting on Line 127 should include a language tag (e.g., text) to satisfy markdown lint rules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/README.md` at line 127, The architecture fenced code block in
k8s/README.md is missing a language identifier; update the opening fence from
``` to a tagged fence such as ```text (or another appropriate language) so the
block satisfies markdown lint rules and renders with correct syntax
highlighting.

92-97: ⚠️ Potential issue | 🟠 Major

Fix endpoint/protocol mismatch in inference examples.

Lines 93, 95, and 107 use https://inference.local, but the documented deployment path uses the HTTP socat bridge (http://host.openshell.internal:8000/v1 / localhost proxy). These commands are likely to fail as written.

Also applies to: 107-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/README.md` around lines 92 - 97, The curl examples in the README use the
wrong protocol/host (https://inference.local) which doesn't match the documented
socat bridge; update each example (the two curl commands shown for listing
models and chat completions and any other occurrences such as the one at line
107) to use the documented HTTP bridge host and port
(http://host.openshell.internal:8000/v1 or the equivalent localhost proxy) so
the paths and protocol match the deployed inference endpoint.
k8s/nemoclaw-k8s.yaml (1)

69-71: ⚠️ Potential issue | 🟠 Major

Avoid curl | bash for installer execution; add integrity verification.

Line 71 still executes remote code directly. Downloading a versioned artifact and verifying checksum/signature before execution is safer and reproducible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 69 - 71, The line that pipes remote
script into bash ("curl -fsSL https://nvidia.com/nemoclaw.sh | bash") should be
replaced by a safe, verifiable install flow: download a versioned artifact to a
local file (e.g., download https://nvidia.com/nemoclaw-<version>.sh), fetch the
corresponding checksum or signature, verify the file integrity (sha256sum or gpg
verify) before execution, and only then run the local installer script; update
the script invocation around the echo "[4/4] Running NemoClaw installer..." and
replace the direct pipe with this download -> verify -> execute sequence using
the unique installer invocation string "nemoclaw.sh" to locate and change the
code.
🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)

38-123: Set explicit container security contexts (even if DinD remains privileged).

workspace and init-docker-config rely on defaults. Please explicitly set least-privilege controls (allowPrivilegeEscalation, capabilities, seccompProfile) to improve security posture and satisfy baseline policy checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 38 - 123, Add explicit securityContext
blocks for the workspace and init-docker-config containers: for the workspace
container (name: workspace) add a securityContext that documents DinD needs
(e.g., privileged: true if required), set allowPrivilegeEscalation explicitly,
drop all capabilities (capabilities: { drop: ["ALL"] }) and set seccompProfile
to RuntimeDefault; for the init container (name: init-docker-config) add a
least-privilege securityContext with allowPrivilegeEscalation: false,
capabilities drop ALL, seccompProfile: RuntimeDefault, and enforce
readOnlyRootFilesystem: true and runAsNonRoot: true where compatible. Ensure
these securityContext entries are added under the corresponding container specs
(workspace and init-docker-config) so policy checks pass while preserving
required DinD permissions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 117-122: Replace the ephemeral emptyDir volumes used for DinD
layers with a host-backed or persistent volume: change the volume definitions
for docker-storage (and consider docker-socket/docker-config if they must
persist) from emptyDir to either a hostPath (pointing to a host directory) or
reference a PersistentVolumeClaim, and update any related volumeMounts and pod
spec to use that host-backed volume; ensure the chosen PersistentVolume/PVC or
hostPath has appropriate permissions and retention so DinD image layers are not
evicted during builds.

---

Duplicate comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 69-71: The line that pipes remote script into bash ("curl -fsSL
https://nvidia.com/nemoclaw.sh | bash") should be replaced by a safe, verifiable
install flow: download a versioned artifact to a local file (e.g., download
https://nvidia.com/nemoclaw-<version>.sh), fetch the corresponding checksum or
signature, verify the file integrity (sha256sum or gpg verify) before execution,
and only then run the local installer script; update the script invocation
around the echo "[4/4] Running NemoClaw installer..." and replace the direct
pipe with this download -> verify -> execute sequence using the unique installer
invocation string "nemoclaw.sh" to locate and change the code.

In `@k8s/README.md`:
- Line 127: The architecture fenced code block in k8s/README.md is missing a
language identifier; update the opening fence from ``` to a tagged fence such as
```text (or another appropriate language) so the block satisfies markdown lint
rules and renders with correct syntax highlighting.
- Around line 92-97: The curl examples in the README use the wrong protocol/host
(https://inference.local) which doesn't match the documented socat bridge;
update each example (the two curl commands shown for listing models and chat
completions and any other occurrences such as the one at line 107) to use the
documented HTTP bridge host and port (http://host.openshell.internal:8000/v1 or
the equivalent localhost proxy) so the paths and protocol match the deployed
inference endpoint.

---

Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 38-123: Add explicit securityContext blocks for the workspace and
init-docker-config containers: for the workspace container (name: workspace) add
a securityContext that documents DinD needs (e.g., privileged: true if
required), set allowPrivilegeEscalation explicitly, drop all capabilities
(capabilities: { drop: ["ALL"] }) and set seccompProfile to RuntimeDefault; for
the init container (name: init-docker-config) add a least-privilege
securityContext with allowPrivilegeEscalation: false, capabilities drop ALL,
seccompProfile: RuntimeDefault, and enforce readOnlyRootFilesystem: true and
runAsNonRoot: true where compatible. Ensure these securityContext entries are
added under the corresponding container specs (workspace and init-docker-config)
so policy checks pass while preserving required DinD permissions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6142f3a7-46ea-461e-b9f6-9c3ad360eeba

📥 Commits

Reviewing files that changed from the base of the PR and between 1e7e5e7 and 01f944a.

📒 Files selected for processing (2)
  • k8s/README.md
  • k8s/nemoclaw-k8s.yaml

kjw3 and others added 3 commits March 26, 2026 09:34
- Remove multi-document YAML (move namespace creation to README)
- Add language specifier to fenced code block (```text)
- Add blank lines before lists per markdownlint rules

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
- Add explicit experimental warning at top of README
- Clarify this is for trying NemoClaw on k8s, not production
- Document privileged pod and DinD requirements upfront
- Add resource requirements to prerequisites

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
k8s/nemoclaw-k8s.yaml (1)

15-15: Pin container images by digest for reproducibility.

Using floating tags (docker:24-dind, node:22, busybox) can silently change behavior between runs. Prefer immutable digest pins, especially for repeatable test infra.

📌 Example pinning pattern
-      image: docker:24-dind
+      image: docker:24-dind@sha256:<docker-dind-digest>
...
-      image: node:22
+      image: node:22@sha256:<node-digest>
...
-      image: busybox
+      image: busybox@sha256:<busybox-digest>

Also applies to: 36-36, 105-105

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` at line 15, The image fields currently use floating
tags (e.g., the entries with "image: docker:24-dind", "image: node:22", and
"image: busybox") which should be replaced with immutable digest-pinned
references; update each image value to the corresponding registry digest form
(e.g., docker:24-dind@sha256:<digest>) for the occurrences referenced in the
diff (the image lines at 15, 36, and 105) so the manifest always pulls an exact,
reproducible image; obtain the correct sha256 digest from the image registry
(Docker Hub or your registry) and replace the tag-only values with the full
digest-pinned strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Around line 11-12: Add automountServiceAccountToken: false at the Pod spec
level to disable automatic mounting of the service account token; update the
top-level spec (the same block that contains containers) to include
automountServiceAccountToken: false so the pod running Docker-in-Docker and the
NemoClaw installer will not receive a mounted service-account token.
- Around line 50-53: The socat bridge is started in the background (the `socat
TCP-LISTEN:8000,... &` line) but not validated; modify this block to capture its
PID ($!), wait briefly and verify it is listening or still running (e.g., loop
with `kill -0 $SOCAT_PID` or test port 8000 with `nc`/`ss`), and if it has
exited or the port is not open within a short timeout log an error and exit
non‑zero so onboarding fails fast; do not rely on `set -e` for background
processes and keep the existing hosts entry (`echo "127.0.0.1
host.openshell.internal" >> /etc/hosts`) logic.

---

Nitpick comments:
In `@k8s/nemoclaw-k8s.yaml`:
- Line 15: The image fields currently use floating tags (e.g., the entries with
"image: docker:24-dind", "image: node:22", and "image: busybox") which should be
replaced with immutable digest-pinned references; update each image value to the
corresponding registry digest form (e.g., docker:24-dind@sha256:<digest>) for
the occurrences referenced in the diff (the image lines at 15, 36, and 105) so
the manifest always pulls an exact, reproducible image; obtain the correct
sha256 digest from the image registry (Docker Hub or your registry) and replace
the tag-only values with the full digest-pinned strings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7befcf59-fe03-4fb4-8018-3570724a1e91

📥 Commits

Reviewing files that changed from the base of the PR and between 01f944a and 7c2b970.

📒 Files selected for processing (2)
  • k8s/README.md
  • k8s/nemoclaw-k8s.yaml
✅ Files skipped from review due to trivial changes (1)
  • k8s/README.md

Comment on lines +11 to +12
spec:
containers:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n k8s/nemoclaw-k8s.yaml

Repository: NVIDIA/NemoClaw

Length of output: 4492


Disable automatic service-account token mount for this pod.

The pod does not call the Kubernetes API—it only runs Docker-in-Docker and the NemoClaw installer. Mounting a token by default increases blast radius if compromised. Add automountServiceAccountToken: false at the spec level.

🔒 Proposed hardening
 spec:
+  automountServiceAccountToken: false
   containers:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
spec:
containers:
spec:
automountServiceAccountToken: false
containers:
🧰 Tools
🪛 Trivy (0.69.3)

[error] 11-119: Default security context configured

pod nemoclaw in nemoclaw namespace is using the default security context, which allows root privileges

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 11 - 12, Add
automountServiceAccountToken: false at the Pod spec level to disable automatic
mounting of the service account token; update the top-level spec (the same block
that contains containers) to include automountServiceAccountToken: false so the
pod running Docker-in-Docker and the NemoClaw installer will not receive a
mounted service-account token.

Comment on lines +50 to +53
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "nemoclaw-k8s.yaml" -type f

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

cat -n ./k8s/nemoclaw-k8s.yaml | head -80

Repository: NVIDIA/NemoClaw

Length of output: 3131


🏁 Script executed:

cat -n ./k8s/nemoclaw-k8s.yaml | sed -n '40,70p'

Repository: NVIDIA/NemoClaw

Length of output: 1416


Fail fast if the socat bridge does not start.

At line 50, socat is backgrounded but its health is never validated. If it exits immediately, onboarding continues and fails later with less actionable errors. Note that set -e only applies to foreground commands, not background processes.

🛠️ Proposed reliability check
-          socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+          : "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}"
+          socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+          SOCAT_PID=$!
           # Add hosts entry so validation can reach socat via host.openshell.internal
           echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
           sleep 1
+          kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
: "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}"
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
SOCAT_PID=$!
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 50 - 53, The socat bridge is started in
the background (the `socat TCP-LISTEN:8000,... &` line) but not validated;
modify this block to capture its PID ($!), wait briefly and verify it is
listening or still running (e.g., loop with `kill -0 $SOCAT_PID` or test port
8000 with `nc`/`ss`), and if it has exited or the port is not open within a
short timeout log an error and exit non‑zero so onboarding fails fast; do not
rely on `set -e` for background processes and keep the existing hosts entry
(`echo "127.0.0.1 host.openshell.internal" >> /etc/hosts`) logic.

@kjw3 kjw3 merged commit 3c7bd93 into NVIDIA:main Mar 30, 2026
8 checks passed
realkim93 pushed a commit to realkim93/NemoClaw that referenced this pull request Mar 30, 2026
* feat: add Kubernetes testing infrastructure

Add k8s-testing/ directory with scripts and manifests for testing NemoClaw
on Kubernetes with Dynamo vLLM inference.

Includes:
- test-installer.sh: Public installer test (requires unattended install support)
- setup.sh: Manual setup from source for development
- Pod manifests for Docker-in-Docker execution

Architecture: OpenShell runs k3s inside Docker, so we use DinD pods
to provide the Docker daemon on Kubernetes.

Signed-off-by: rwipfelnv

* fix: add socat proxy for K8s DNS isolation workaround

OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names,
so inference requests fail with 502 Bad Gateway. This adds:

- socat TCP proxy setup in setup.sh to forward localhost:8000 to the
  K8s vLLM service endpoint
- Provider configuration using host.openshell.internal:8000 which
  resolves to the workspace container from inside k3s
- Documentation explaining the network architecture and workaround
- Updated env var names to match PR NVIDIA#318 (NEMOCLAW_NON_INTERACTIVE)
- cgroup v2 compatibility fix for Docker daemon
- Removed memory limits that caused OOM

Tested: Inference requests from sandboxes now route correctly through
the socat proxy to the Dynamo vLLM endpoint.

Depends on: NVIDIA#318 (non-interactive mode), NVIDIA#365 (Dynamo provider)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support

Complete K8s deployment solution for NemoClaw:
- nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage
- install.sh: Interactive installer with preflight checks
- Rename k8s-testing -> k8s, move old files to dev/

Key learnings:
- hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction
- Init containers for docker config, openshell CLI, NemoClaw build
- Workspace container installs apt packages at runtime (can't share via volumes)
- socat proxy bridges K8s DNS to nested k3s (host.openshell.internal)

Tested successfully with Dynamo vLLM backend on EKS.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* fix: rename NemoKlaw to NemoClaw and document known limitations

Address PR feedback:
- Rename NemoKlaw -> NemoClaw (avoid confusing naming)
- Rename nemoklaw.yaml -> nemoclaw-k8s.yaml
- Fix hardcoded endpoint to use generic example
- Remove log file from repo
- Document known limitations (HTTPS proxy issue)
- Update README with accurate status of what works/doesn't work

Signed-off-by: rwipfelnv
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: update DYNAMO_HOST to vllm-agg-frontend

The aggregated frontend service is the correct endpoint for
Dynamo vLLM inference.

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* docs: add Using NemoClaw section with CLI commands

- Add workspace shell access command
- Add sandbox status/logs/list commands
- Add chat completion test example
- Rename section from "What Can You Do?" to "Using NemoClaw"

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>

* refactor(k8s): simplify deployment to use official installer

- Use official NemoClaw installer (`curl | bash`) instead of git clone/build
- Switch to `custom` provider from PR NVIDIA#648 (supersedes dynamo-specific provider)
- Remove k8s/dev/ directory (no longer needed for testing)
- Use emptyDir volumes for portability across clusters
- Add /etc/hosts workaround for endpoint validation during onboarding
- Update README with verification steps for local inference

Tested end-to-end with Dynamo vLLM backend.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(k8s): resolve lint errors in yaml and markdown

- Remove multi-document YAML (move namespace creation to README)
- Add language specifier to fenced code block (```text)
- Add blank lines before lists per markdownlint rules

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* docs(k8s): add experimental warning and clarify requirements

- Add explicit experimental warning at top of README
- Clarify this is for trying NemoClaw on k8s, not production
- Document privileged pod and DinD requirements upfront
- Add resource requirements to prerequisites

Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

---------

Signed-off-by: rwipfelnv
Signed-off-by: Robert Wipfel <rwipfel@nvidia.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: KJ <kejones@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docker Support for Docker containerization enhancement New feature or request K8s Use this label to identify Kubernetes deployment issues with NemoClaw.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants