Skip to content

Add NemoClaw sandbox + LiteLLM proxy integration#2

Open
kosaku-sim wants to merge 18 commits intomainfrom
feature/nemoclaw-litellm-integration
Open

Add NemoClaw sandbox + LiteLLM proxy integration#2
kosaku-sim wants to merge 18 commits intomainfrom
feature/nemoclaw-litellm-integration

Conversation

@kosaku-sim
Copy link
Copy Markdown
Member

@kosaku-sim kosaku-sim commented Mar 23, 2026

概要

NVIDIA NemoClaw(OpenShell)サンドボックスと LiteLLM プロキシを Linux CloudFormation テンプレートに統合。OpenClaw エージェントをカーネルレベルで隔離されたサンドボックス内で実行し、OpenShell の managed inference proxy 経由で Amazon Bedrock にアクセスします。

ホストに OpenClaw (22GB) をインストールせず、サンドボックスイメージ内の OpenClaw のみを使用することで、ディスク使用量を約 5GB に抑えます。

EnableSandbox=false(従来)と EnableSandbox=true(本PR)の比較

項目 EnableSandbox=false(従来) EnableSandbox=true(本PR)
OpenClaw の場所 ホストに直接インストール(npm, 22GB) sandbox イメージに内蔵(ホストにインストール不要, ~5GB)
コード実行の隔離 OpenClaw 内蔵の Docker sandbox(アプリレベル) Landlock + seccomp + Network Namespace(カーネルレベル)
ネットワーク隔離 Docker network(基本的にインターネット到達可能) デフォルト全遮断、https://inference.local のみ許可
ファイルシステム隔離 Docker volume mount Landlock LSM で /sandbox/tmp のみ書き込み可
API キーの保護 コンテナ内の環境変数に存在 sandbox 内に存在しない(OpenShell プロキシがホスト側で注入)
LLM アクセス経路 OpenClaw → Bedrock(直接) OpenClaw → inference.local → LiteLLM → Bedrock
ディスク使用量 ~25GB(Chromium含むnpmパッケージ) ~5GB(sandbox image + LiteLLM venv)

従来の Docker sandbox は OpenClaw のアプリケーション機能で、エージェントのコード実行を Docker コンテナ内で行う仕組みです。NemoClaw sandbox はそれとは異なり、OpenClaw 自体をカーネルレベルで隔離された環境に閉じ込めるため、Docker-in-Docker は不要です。

アーキテクチャ

Browser → SSM Port Forward → host:18789 (SSH LocalForward)
  → Sandbox:18789 (OpenClaw Gateway)
  → https://inference.local (OpenShell managed inference proxy)
  → host.openshell.internal:4000 (LiteLLM on host)
  → Amazon Bedrock

セキュリティモデル(3層のLinuxカーネル隔離)

  • Landlock LSM: ファイルシステムアクセス制御(/sandbox/tmp のみ書き込み可)
  • seccomp-BPF: 危険なシステムコールをブロック
  • Network Namespace: デフォルトで全アウトバウンド通信を遮断。https://inference.local のみ許可
  • クレデンシャル注入: API キーはホスト側の OpenShell プロキシが保持。サンドボックス内には存在しない

変更内容

scripts/setup-nemoclaw-litellm.sh — 完全書き直し

10ステップの自動化スクリプト:

  1. inotify制限設定(k3s安定動作に必須)
  2. LiteLLM プロキシのインストールと systemd サービス化
  3. OpenShell CLI のインストール(NemoClaw installer経由)
  4. OpenShell ゲートウェイ起動
  5. host.openshell.internal のIP修正(Docker network gateway IP を動的検出)
  6. LiteLLM をOpenAI互換プロバイダーとして登録 + inference route 設定
  7. Sandbox policy 作成(network_policies で host.openshell.internal:4000 を許可)
  8. サンドボックス作成 + CRD hostAliases パッチ + Pod再作成
  9. SSH経由でOpenClaw設定配信(https://inference.local/v1 をbaseUrlに使用)+ ゲートウェイ起動
  10. SSH LocalForward の systemd サービス化(host:18789 → sandbox:18789)

clawdbot-bedrock.yaml — CloudFormation テンプレート

  • EnableSandbox パラメータ(デフォルト: true)で NemoClaw+LiteLLM を制御
  • sandbox mode 時に inotify 制限を自動設定
  • フォールバック処理: openshell-forward サービス再起動に修正
  • sandbox 名を openclaw に統一
  • SOUL.md をsandbox mode ではSSH経由で配信
  • ダッシュボードURL形式: ?token=#token= に修正

ドキュメント

  • SECURITY.md: NemoClaw + LiteLLM アーキテクチャのセキュリティドキュメント
  • TROUBLESHOOTING.md: NemoClaw/LiteLLM トラブルシューティングセクション追加
  • README.md: アーキテクチャ図とパラメータ表を更新
  • DEPLOYMENT.md: NemoClaw/LiteLLM の確認手順を追加

解決した技術課題

課題 原因 解決策
LLM request timeout sandbox から LiteLLM に直接到達不可(HTTP proxy が 403) https://inference.local(managed inference proxy)経由に変更
host.openshell.internal 到達不可 OpenShell が docker0 (172.17.0.1) を設定するが、k3s クラスタは別 Docker network Docker network gateway IP を動的検出し、CRD hostAliases をパッチ
device token mismatch sandbox 内の .openclaw/identity ディレクトリ権限不足 SSH 経由で内部からディレクトリ作成(overlayfs キャッシュ問題を回避)
gateway token missing ?token= がリダイレクトで消失 #token=(フラグメント)形式に変更
k3s namespace not ready inotify インスタンス制限(デフォルト128)を k3s が枯渇 fs.inotify.max_user_instances=512 に設定
OpenClaw config invalid 2026.3.11 のスキーマが異なる(provider, auth.mode 等は無効) 正しいスキーマ(gateway/models.providers/agents.defaults)を使用

テスト計画

  • https://inference.local 経由で LiteLLM → Bedrock 接続確認
  • Control UI (Health OK, Version 2026.3.11) で日本語チャット動作確認
  • sandbox 内 OpenClaw ゲートウェイ起動確認(port 18789)
  • EnableSandbox=true で新規 CloudFormation スタックデプロイ(エンドツーエンド)
  • EnableSandbox=false で既存 Bedrock 直接接続フローが正常動作
  • SSM ポートフォワーディング経由で Web UI にアクセス確認

Closes #1

🤖 Generated with Claude Code

kosaku-sim and others added 18 commits March 23, 2026 20:33
… AI execution

Integrates NVIDIA NemoClaw (OpenShell) and LiteLLM into the Linux CloudFormation
template to provide defense-in-depth isolation: OpenClaw runs inside a
network-restricted sandbox that can only reach the LiteLLM proxy on localhost:4000,
which proxies all model requests to Amazon Bedrock via IAM role.

Closes #1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace heredoc configs with python/printf generation, remove comments
and blank lines from UserData to reduce from ~40KB to ~25KB base64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move inline NemoClaw install, LiteLLM config, network policy, and
sandbox gateway service code from CloudFormation UserData to external
script (scripts/setup-nemoclaw-litellm.sh). UserData now downloads and
executes the script when EnableSandbox=true, reducing raw size from
~21KB to ~12KB (well within the 16KB EC2 limit).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This script is downloaded and executed by UserData when
EnableSandbox=true. Contains LiteLLM proxy install/config,
NemoClaw sandbox setup, network policy, and systemd services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
raw.githubusercontent.com requires %2F for branch names containing
slashes (feature/nemoclaw-litellm-integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NemoClaw installer (nc.sh) requires HOME to be set. CloudFormation
UserData runs as root without HOME exported, causing 'unbound variable'
error with set -e.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use `nemoclaw onboard --non-interactive` instead of manual sandbox creation
- Register LiteLLM as OpenShell provider via `openshell provider create`
- Set inference route via `openshell inference set` (not config file)
- Use `host.openshell.internal` for sandbox-to-host LiteLLM access
- Bind LiteLLM on 0.0.0.0 so sandbox can reach it
- Add persistent port forward via systemd wrapper service
- Pre-stage OpenClaw config with allowedOrigins for SSM access
- Update fallback restart to use openshell-forward service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…all steps

- Move OpenClaw config writing BEFORE nc.sh install (onboard copies it)
- Remove explicit `nemoclaw onboard` (nc.sh --non-interactive does it)
- Add /root/.local/bin to PATH after NemoClaw install
- Add PATH to systemd service environment
- Fix sandbox name detection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove set -e (installer stops at [4/7] without NVIDIA_API_KEY, expected)
- Find NVM-installed node and add to PATH after install
- Wait for sandbox to become ready before configuring provider
- Add comments explaining NIM API key skip behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The NemoClaw installer retry (||) re-runs onboard which recreates the
gateway and destroys the existing sandbox. Run once only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rol UI

NemoClaw sandbox handles agent execution (messaging, CLI, tools).
Host OpenClaw gateway serves Control UI on port 18789 (auth=none).
This avoids the device identity bug in NemoClaw's OpenClaw 2026.3.11.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change api from "openai" to "openai-completions" (valid enum value)
- Remove invalid "auth":"none" from provider config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenClaw npm package alone uses ~22GB (includes Chromium, Control UI,
plugins). Combined with NemoClaw Docker images and LiteLLM, 30GB is
insufficient and causes disk full issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip Node.js/npm install on host when EnableSandbox=true (saves 22GB)
- Update OpenClaw inside NemoClaw sandbox to latest (fixes device identity bug)
- Patch sandbox config to auth.mode=none via overlayfs
- Set up persistent openshell forward for port 18789
- Run messaging plugin enablement inside sandbox
- Revert EBS to 30GB (sufficient without host npm install)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ture

Replace the NemoClaw onboard-dependent setup with a direct openshell CLI
workflow that uses the managed inference proxy (https://inference.local).

Setup script changes:
- Use openshell gateway/provider/inference/sandbox commands directly
- Route LLM requests via https://inference.local (bypasses sandbox proxy)
- Fix host.openshell.internal IP (detect Docker network gateway dynamically)
- Patch Sandbox CRD hostAliases for correct host resolution
- Deliver config via SSH tee (replaces brittle overlayfs patching)
- SSH LocalForward systemd service (replaces openshell forward)
- Add inotify sysctl limits (prevents k3s "too many open files" crash)

CloudFormation changes:
- Add inotify limits before Docker install in sandbox mode
- Fix fallback to restart openshell-forward service
- Standardize sandbox name to "openclaw"
- Deliver SOUL.md via SSH in sandbox mode
- Fix dashboard URL format (?token= -> #token=)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NemoClaw + LiteLLM setup takes longer than 20 minutes due to:
- apt-get upgrade
- LiteLLM pip install
- NemoClaw installer + Docker image pulls
- OpenShell gateway + sandbox creation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three issues found during end-to-end CloudFormation deploy test:

1. openshell sandbox create --no-tty hangs on SSH session
   → Run in background, wait for Ready status, then kill

2. Port 18789 conflict: docker-proxy (openshell gateway) occupies it
   → SSH LocalForward binds on 18790 instead
   → SSM port forward targets 18790, maps to local 18789

3. openshell-forward service (User=ubuntu) can't find gateway metadata
   → Run as root with HOME=/root and PATH including nvm node
   → SSH config placed in /root/.ssh/ (not ubuntu's)

Also fix all SSH commands to run as root (consistent with setup context).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NemoClaw + LiteLLM 統合: セキュアな組織向けBedrock環境の構築

1 participant