-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
enhancement: featureUse this label to identify requests for new capabilities in NemoClaw.Use this label to identify requests for new capabilities in NemoClaw.priority: highImportant issue that should be resolved in the next releaseImportant issue that should be resolved in the next releasesecuritySomething isn't secureSomething isn't secure
Description
Problem Statement
NemoClaw relies on OpenShell policies (filesystem + network rules) to constrain agent behavior. There is no AI-level verification of agent actions before execution — the agent decides and acts within the policy boundaries.
This means:
- Any action within policy boundaries is automatically allowed
- No second opinion on whether an action is appropriate
- No distinction between "technically allowed" and "operationally safe"
Impact
An agent operating within policy boundaries can still perform harmful actions:
- Delete all user data in writable directories (allowed by filesystem policy)
- Send sensitive data to allowed endpoints (allowed by network policy)
- Execute a sequence of individually-safe actions that are collectively dangerous
Proposed Design
Implement a stateless LLM safety gate that evaluates every command before execution:
- Separate model: Use a small, dedicated safety model (e.g., 8B parameter) distinct from the main agent
- Zero conversation context: The safety model sees ONLY the proposed action, not the conversation that led to it. This prevents social engineering through context buildup
- Binary decision: ALLOW or DENY with reasoning
- Runs after pattern denylist: Deterministic checks first (zero-latency), LLM gate second (for novel patterns)
safety:
pattern_denylist: enabled # Layer 1: deterministic, zero-latency
llm_gate: # Layer 2: catches what patterns miss
model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
context: none # stateless — no conversation history
action: deny_and_kill # on denial: abort the agentThe key insight: a single LLM cannot reliably judge its own actions (self-enforcement fails under adversarial conditions). A separate, stateless model with no shared context provides independent verification.
References
- NeMo Agent Toolkit PR #1605 implements a similar concept (
PreToolVerifierMiddleware) for NAT workflows (and awaiting issue #1811 ) - Research demonstrates that single-model self-enforcement fails under adversarial prompting
Alternatives Considered
No response
Category
enhancement: feature
Checklist
- I searched existing issues and this is not a duplicate
- This is a design proposal, not a "please build this" request
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancement: featureUse this label to identify requests for new capabilities in NemoClaw.Use this label to identify requests for new capabilities in NemoClaw.priority: highImportant issue that should be resolved in the next releaseImportant issue that should be resolved in the next releasesecuritySomething isn't secureSomething isn't secure