Skip to content

[SECURITY] No independent LLM verification before command execution #800

@h-network

Description

@h-network

Problem Statement

NemoClaw relies on OpenShell policies (filesystem + network rules) to constrain agent behavior. There is no AI-level verification of agent actions before execution — the agent decides and acts within the policy boundaries.

This means:

  • Any action within policy boundaries is automatically allowed
  • No second opinion on whether an action is appropriate
  • No distinction between "technically allowed" and "operationally safe"

Impact

An agent operating within policy boundaries can still perform harmful actions:

  • Delete all user data in writable directories (allowed by filesystem policy)
  • Send sensitive data to allowed endpoints (allowed by network policy)
  • Execute a sequence of individually-safe actions that are collectively dangerous

Proposed Design

Implement a stateless LLM safety gate that evaluates every command before execution:

  1. Separate model: Use a small, dedicated safety model (e.g., 8B parameter) distinct from the main agent
  2. Zero conversation context: The safety model sees ONLY the proposed action, not the conversation that led to it. This prevents social engineering through context buildup
  3. Binary decision: ALLOW or DENY with reasoning
  4. Runs after pattern denylist: Deterministic checks first (zero-latency), LLM gate second (for novel patterns)
safety:
  pattern_denylist: enabled    # Layer 1: deterministic, zero-latency
  llm_gate:                    # Layer 2: catches what patterns miss
    model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
    context: none              # stateless — no conversation history
    action: deny_and_kill      # on denial: abort the agent

The key insight: a single LLM cannot reliably judge its own actions (self-enforcement fails under adversarial conditions). A separate, stateless model with no shared context provides independent verification.

References

  • NeMo Agent Toolkit PR #1605 implements a similar concept (PreToolVerifierMiddleware) for NAT workflows (and awaiting issue #1811 )
  • Research demonstrates that single-model self-enforcement fails under adversarial prompting

Alternatives Considered

No response

Category

enhancement: feature

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

Labels

enhancement: featureUse this label to identify requests for new capabilities in NemoClaw.priority: highImportant issue that should be resolved in the next releasesecuritySomething isn't secure

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions