Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions .agents/skills/debug-navigator-cluster/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
---
name: debug-navigator-cluster
description: Debug why a nemoclaw cluster failed to start or is unhealthy. Use when the user has a failed `ncl cluster admin deploy`, cluster health check failure, or wants to diagnose cluster infrastructure issues. Trigger keywords - debug cluster, cluster failing, cluster not starting, deploy failed, cluster troubleshoot, cluster health, cluster diagnose, why won't my cluster start, health check failed.
description: Debug why a nemoclaw cluster failed to start or is unhealthy. Use when the user has a failed `nemoclaw cluster admin deploy`, cluster health check failure, or wants to diagnose cluster infrastructure issues. Trigger keywords - debug cluster, cluster failing, cluster not starting, deploy failed, cluster troubleshoot, cluster health, cluster diagnose, why won't my cluster start, health check failed.
---

# Debug NemoClaw Cluster

Diagnose why a nemoclaw cluster failed to start after `ncl cluster admin deploy`.
Diagnose why a nemoclaw cluster failed to start after `nemoclaw cluster admin deploy`.

## Overview

`ncl cluster admin deploy` creates a Docker container running k3s with the NemoClaw server and Envoy Gateway deployed via Helm. The deployment stages, in order, are:
`nemoclaw cluster admin deploy` creates a Docker container running k3s with the NemoClaw server and Envoy Gateway deployed via Helm. The deployment stages, in order, are:

1. **Pre-deploy check**: `ncl cluster admin deploy` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
1. **Pre-deploy check**: `nemoclaw cluster admin deploy` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
2. Ensure cluster image is available (local build or remote pull)
3. Create Docker network (`navigator-cluster`) and volume (`navigator-cluster-{name}`)
4. Create and start a privileged Docker container (`navigator-cluster-{name}`)
Expand All @@ -31,7 +31,7 @@ For local deploys, metadata endpoint selection now depends on Docker connectivit
- default local Docker socket (`unix:///var/run/docker.sock`): `https://127.0.0.1:{port}` (default port 8080)
- TCP Docker daemon (`DOCKER_HOST=tcp://<host>:<port>`): `https://<host>:{port}` for non-loopback hosts

The host port is configurable via `--port` on `ncl cluster admin deploy` (default 8080) and is stored in `ClusterMetadata.gateway_port`.
The host port is configurable via `--port` on `nemoclaw cluster admin deploy` (default 8080) and is stored in `ClusterMetadata.gateway_port`.

The TCP host is also added as an extra gateway TLS SAN so mTLS hostname validation succeeds.

Expand All @@ -40,7 +40,7 @@ The default cluster name is `nemoclaw`. The container is `navigator-cluster-{nam
## Prerequisites

- Docker must be running (locally or on the remote host)
- The `ncl` CLI must be available
- The `nemoclaw` CLI must be available
- For remote clusters: SSH access to the remote host

## Workflow
Expand Down Expand Up @@ -331,7 +331,7 @@ docker -H ssh://<host> logs navigator-cluster-<name>
**Setting up kubectl access** (requires tunnel):

```bash
ncl cluster admin tunnel --name <name> --remote <host>
nemoclaw cluster admin tunnel --name <name> --remote <host>
# Then in another terminal:
export KUBECONFIG=~/.config/nemoclaw/clusters/<name>/kubeconfig
kubectl get pods -A
Expand Down
14 changes: 7 additions & 7 deletions .agents/skills/tui-development/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Comprehensive reference for any agent working on the Gator TUI.

## 1. Overview

Gator is a ratatui-based terminal UI for the NemoClaw platform. It provides a keyboard-driven interface for managing clusters, sandboxes, and logs — the same operations available via the `ncl` CLI, but with a live, interactive dashboard.
Gator is a ratatui-based terminal UI for the NemoClaw platform. It provides a keyboard-driven interface for managing clusters, sandboxes, and logs — the same operations available via the `nemoclaw` CLI, but with a live, interactive dashboard.

- **Launched via:** `ncl gator` or `mise run gator`
- **Launched via:** `nemoclaw gator` or `mise run gator`
- **Crate:** `crates/navigator-tui/`
- **Key dependencies:**
- `ratatui` (workspace version) — uses `frame.size()` (not `frame.area()`)
Expand Down Expand Up @@ -225,14 +225,14 @@ The `confirm_delete` flag in `App` gates destructive key handling — while true

### CLI parity

Gator actions should parallel `ncl` CLI commands so users have familiar mental models:
Gator actions should parallel `nemoclaw` CLI commands so users have familiar mental models:

| CLI Command | Gator Equivalent |
| --- | --- |
| `ncl sandbox list` | Sandbox table on Dashboard |
| `ncl sandbox delete <name>` | `[d]` on sandbox detail, then `[y]` to confirm |
| `ncl sandbox logs <name>` | `[l]` on sandbox detail to open log viewer |
| `ncl cluster health` | Status in title bar + cluster list |
| `nemoclaw sandbox list` | Sandbox table on Dashboard |
| `nemoclaw sandbox delete <name>` | `[d]` on sandbox detail, then `[y]` to confirm |
| `nemoclaw sandbox logs <name>` | `[l]` on sandbox detail to open log viewer |
| `nemoclaw cluster health` | Status in title bar + cluster list |

When adding new TUI features, check what the CLI offers and maintain consistency.

Expand Down
2 changes: 1 addition & 1 deletion .claude/agent-memory/arch-doc-writer/MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@
- Helm chart deploys a StatefulSet (NOT Deployment), PVC 1Gi at /var/navigator
- Cluster image does NOT bundle image tarballs -- components pulled at runtime from distribution registry
- PKI job generates CA + server cert + client cert for mTLS (RSA 2048, 10yr, Helm pre-install hook)
- Build tasks in `build/*.toml`; scripts in `build/scripts/`
- Build tasks in `tasks/*.toml`; scripts in `tasks/scripts/`
- `cluster-deploy-fast.sh` supports both auto mode (git diff) and explicit targets (server/sandbox/pki-job/chart/all)
- `cluster-bootstrap.sh` ensures local Docker registry on port 5000, pushes all components, then deploys
- Default values.yaml: repository is CloudFront-backed CDN, tag: "latest", pullPolicy: Always
Expand Down
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# basename (e.g. "nemoclaw-c").
#CLUSTER_NAME=nemoclaw-c

# Default cluster name used by `ncl` commands in this repo when `--cluster`
# Default cluster name used by `nemoclaw` commands in this repo when `--cluster`
# is not provided. Usually matches CLUSTER_NAME.
#NEMOCLAW_CLUSTER=nemoclaw-c

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ on:
paths:
- 'deploy/docker/Dockerfile.ci'
- 'mise.toml'
- 'build/**'
- 'tasks/**'
- '.github/workflows/ci-image.yml'
workflow_dispatch:

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,14 @@ jobs:
id: version
run: |
set -euo pipefail
WHEEL_VERSION=$(uv run python build/scripts/release.py get-version --python)
WHEEL_VERSION=$(uv run python tasks/scripts/release.py get-version --python)
echo "wheel_version=${WHEEL_VERSION}" >> "$GITHUB_OUTPUT"

- name: Build Python wheels
run: |
set -euo pipefail
WHEEL_VERSION="${{ steps.version.outputs.wheel_version }}"
CARGO_VERSION=$(uv run python build/scripts/release.py get-version --cargo)
CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo)
NEMOCLAW_CARGO_VERSION="$CARGO_VERSION" mise run python:build:multiarch
NEMOCLAW_CARGO_VERSION="$CARGO_VERSION" mise run python:build:macos
ls -la target/wheels/*.whl
Expand Down Expand Up @@ -216,4 +216,4 @@ jobs:
run: |
set -euo pipefail
WHEEL_VERSION="${{ needs.build-python-wheels.outputs.wheel_version }}"
uv run python build/scripts/release.py python-publish --version "$WHEEL_VERSION"
uv run python tasks/scripts/release.py python-publish --version "$WHEEL_VERSION"
28 changes: 14 additions & 14 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ cache:
- key:
files:
- Cargo.lock
- build/rust.toml
- tasks/rust.toml
prefix: "target-$CI_RUNNER_EXECUTABLE_ARCH"
paths:
- target/
Expand All @@ -58,9 +58,9 @@ cache:
- Cargo.lock
- crates/**/*
- proto/**/*
- build/rust.toml
- build/test.toml
- build/ci.toml
- tasks/rust.toml
- tasks/test.toml
- tasks/ci.toml
- mise.toml
- .gitlab-ci.yml
- when: never
Expand All @@ -73,9 +73,9 @@ cache:
- python/**/*
- scripts/**/*
- proto/**/*
- build/python.toml
- build/test.toml
- build/ci.toml
- tasks/python.toml
- tasks/test.toml
- tasks/ci.toml
- mise.toml
- .gitlab-ci.yml
- when: never
Expand All @@ -87,10 +87,10 @@ cache:
- deploy/docker/**/*
- deploy/helm/**/*
- deploy/kube/**/*
- build/cluster.toml
- build/docker.toml
- build/test.toml
- build/scripts/**/*
- tasks/cluster.toml
- tasks/docker.toml
- tasks/test.toml
- tasks/scripts/**/*
- crates/**/*
- proto/**/*
- mise.toml
Expand Down Expand Up @@ -119,7 +119,7 @@ build_ci_image:
- changes:
- deploy/docker/Dockerfile.ci
- mise.toml
- build/**/*
- tasks/**/*
- .gitlab-ci.yml
- when: never
script:
Expand Down Expand Up @@ -247,8 +247,8 @@ python_e2e_sandbox_test:
- socat UNIX-LISTEN:/var/run/docker.sock,fork,reuseaddr TCP:docker:2375 &
- sleep 1
- mise run --no-prepare docker:build:cluster
- mise run --no-prepare cluster:build
- mise run --no-prepare test:e2e:sandbox
- mise run --no-prepare cluster:build:full
- mise run --no-prepare test:e2e

# =============================================================================
# Publish Jobs
Expand Down
Loading
Loading