[SECURITY] No independent LLM verification before command execution

### Problem Statement

NemoClaw relies on OpenShell policies (filesystem + network rules) to constrain agent behavior. There is no AI-level verification of agent actions before execution — the agent decides and acts within the policy boundaries.

This means:
- Any action within policy boundaries is automatically allowed
- No second opinion on whether an action is appropriate
- No distinction between "technically allowed" and "operationally safe"

### Impact

An agent operating within policy boundaries can still perform harmful actions:
- Delete all user data in writable directories (allowed by filesystem policy)
- Send sensitive data to allowed endpoints (allowed by network policy)
- Execute a sequence of individually-safe actions that are collectively dangerous

### Proposed Design

Implement a stateless LLM safety gate that evaluates every command before execution:

1. **Separate model**: Use a small, dedicated safety model (e.g., 8B parameter) distinct from the main agent
2. **Zero conversation context**: The safety model sees ONLY the proposed action, not the conversation that led to it. This prevents social engineering through context buildup
3. **Binary decision**: ALLOW or DENY with reasoning
4. **Runs after pattern denylist**: Deterministic checks first (zero-latency), LLM gate second (for novel patterns)

```yaml
safety:
  pattern_denylist: enabled    # Layer 1: deterministic, zero-latency
  llm_gate:                    # Layer 2: catches what patterns miss
    model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
    context: none              # stateless — no conversation history
    action: deny_and_kill      # on denial: abort the agent
```

The key insight: a single LLM cannot reliably judge its own actions (self-enforcement fails under adversarial conditions). A separate, stateless model with no shared context provides independent verification.

### References

- NeMo Agent Toolkit PR #1605 implements a similar concept (`PreToolVerifierMiddleware`) for NAT workflows (and awaiting issue #1811 )
- Research demonstrates that single-model self-enforcement fails under adversarial prompting

### Alternatives Considered

_No response_

### Category

enhancement: feature

### Checklist

- [x] I searched existing issues and this is not a duplicate
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SECURITY] No independent LLM verification before command execution #800

Problem Statement

Impact

Proposed Design

References

Alternatives Considered

Category

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SECURITY] No independent LLM verification before command execution #800

Description

Problem Statement

Impact

Proposed Design

References

Alternatives Considered

Category

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions