Skip to content

vitas/evidra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

554 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Evidra

CI Release Pipeline License

Evidra — Flight recorder and reliability scoring for infrastructure AI agents

Your AI agent fixes Kubernetes. Can you prove it?

Evidra records intent, outcome, and refusal in a signed, append-only evidence chain. It shows risk before execution and reveals patterns like retry loops, drift, and escalation across agents, pipelines, and controllers.

Evidra informs, not enforces. It is the flight recorder and intelligent scoring engine.

Three Evidence Modes

Records what happened Shows risk before action Agent can decline Works with any model
Proxy Observed Yes No No Yes
Smart Prescribe Yes Yes Yes Yes
Full Prescribe Yes Yes Yes Strong models only

Proxy records silently — the agent never knows. Smart and full prescribe are explicit: the agent calls prescribe, receives risk assessment, and decides whether to proceed or decline. Smart prescribe uses 4 fields (~30 tokens); full prescribe sends the complete YAML artifact (~300 tokens) and enables drift detection.

Why Protocol Compliance Matters

In benchmarks across 5 models and 33 scenarios, agents that follow the prescribe/report protocol don't just record evidence — they make better decisions.

The protocol has a cost: every failed attempt requires a prescribe/report pair (~2 extra turns). Agents that brute-force retries burn through their turn budget. Agents that diagnose first and apply once succeed with the same turn budget.

In one scenario, GPT-5.2 retried a broken manifest 3 times in smart mode (6 turns on protocol for failed attempts) and ran out of turns. Claude Sonnet 4 read the manifest, caught the namespace mismatch, fixed it, and applied once — same protocol, zero wasted turns.

The protocol doesn't slow good agents down. It reveals which agents think before acting. That's exactly the signal you want in production infrastructure.

The Prescribe/Report Protocol

Every infrastructure mutation follows the same lifecycle:

prescribe  →  record intent, risk assessment, canonical form
execute    →  run the command (or decline to act)
report     →  record verdict, exit code, or refusal reason

prescribe_full and prescribe_smart capture intent before the command runs. prescribe_full records the artifact, its canonical form, digests, the per-source risk_inputs panel, and the rolled-up effective_risk. prescribe_smart records lightweight target context when artifact bytes are not available. report captures what actually happened — success, failure, or an explicit decision not to act, with structured context for each.

The evidence chain links prescriptions to reports through signed entries with hash chaining. Every entry is timestamped, actor-attributed, and cryptographically verifiable. Evidence cannot be modified after the fact.

When an agent decides not to execute — because risk is too high, because the operation looks wrong — that decision is a first-class evidence entry with trigger and reason. Not a silent gap in the log.

What You Get

Evidra is one platform with three operating surfaces:

Surface What it does
evidra CLI Wraps live commands, imports completed operations, computes scorecards
evidra-mcp Exposes the prescribe/report protocol to MCP-connected agents and runtimes
Self-hosted API Centralizes evidence across agents, pipelines, and controllers, and provides team-wide analytics

From the evidence chain, Evidra computes:

  • Risk classification at operation time — risk_inputs, effective_risk, canonical action digest
  • Behavioral signals — protocol violations, retry loops, blast radius detection
  • Reliability scorecards — score, band, and confidence for comparing agents, sessions, and time windows

Evidra does not replace OTel, Datadog, or Logfire. They record execution telemetry. Evidra records what they cannot: intent before execution, structured decisions, and behavioral patterns across the agent lifecycle.

CLI and MCP are the authoritative analytics surfaces today.

Fastest Path

Install

# Homebrew
brew install samebits/tap/evidra

# Binary release (Linux/macOS)
curl -fsSL https://github.com/samebits/evidra/releases/latest/download/evidra_$(uname -s | tr '[:upper:]' '[:lower:]')_$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/').tar.gz \
  | tar -xz -C /usr/local/bin evidra

# Build from source
make build

Record One Operation

evidra keygen
export EVIDRA_SIGNING_KEY=<base64>

evidra record -f deploy.yaml -- kubectl apply -f deploy.yaml

For local smoke runs without a signing key:

export EVIDRA_SIGNING_MODE=optional

The output includes: risk_inputs, effective_risk, score, score_band, signal_summary, basis, and confidence.

See The Scorecard

evidra scorecard --period 30d
evidra explain --period 30d

Security boundary: evidra record executes the wrapped local command directly. Evidra does not sandbox the command. Treat it with the same trust model as direct shell execution — Evidra records evidence around the command, not contain it.

For AI Agents (MCP)

Evidra speaks MCP. The MCP server exposes the prescribe/report protocol to any MCP-connected agent or runtime.

evidra-mcp --evidence-dir ~/.evidra/evidence

The MCP server gives agents the tools. The skill teaches them when and how to use them — agents with the skill achieve 100% protocol compliance for infrastructure mutations.

evidra skill install

How the protocol looks from the agent's perspective:

Agent: "I need to kubectl apply this deployment"
  → prescribe_smart(tool=kubectl, operation=apply, resource=deployment/web, namespace=default)
  ← prescription_id, effective_risk=medium, risk_inputs=[{source=evidra/matrix, ...}]

Agent: decides to proceed (or decline based on risk)
  → executes kubectl apply
  → report(prescription_id=..., verdict=success, exit_code=0)
  ← score=95, score_band=excellent, signal_summary={...}

If the agent decides not to act:

Agent: "Risk too high, declining"
  → report(prescription_id=..., verdict=declined, decision_context={
      trigger: "risk_threshold_exceeded",
      reason: "privileged container in production"
    })

Declined verdicts are first-class evidence — not silent gaps in the log.

Proxy Observed — one config line, zero agent changes:

{
  "mcpServers": {
    "infra": {
      "command": "evidra-mcp",
      "args": ["--proxy", "--", "npx", "-y", "@anthropic/mcp-server-kubernetes"]
    }
  }
}

References: MCP setup guide · Skill setup guide · Execution schemas

For CI/CD Pipelines

The prescribe/report protocol also works without MCP. Two CLI modes feed the same lifecycle and scoring engine:

evidra record wraps a live command and records the full prescribe/execute/report lifecycle in one step. evidra import ingests a completed operation from structured input for pipelines that manage execution separately.

# Wrap a live command
evidra record -f deploy.yaml -- kubectl apply -f deploy.yaml

# Import a completed operation
evidra import --input record.json

Additional workflows: prescribe, report, scorecard, explain, compare, validate, import-findings.

References: CLI reference · Record/Import contract

For Platform Teams (Self-Hosted)

Run the Evidra backend to centralize evidence collection across agents, pipelines, and GitOps controllers, and get team-wide analytics. Argo CD is controller-first in v1; webhook ingestion remains supported, but it is not the only GitOps path.

export EVIDRA_API_KEY=my-secret-key
docker compose up --build -d
curl http://localhost:8080/healthz

The CLI forwards evidence to the backend:

evidra record --url http://localhost:8080 --api-key my-secret-key \
  -f deploy.yaml -- kubectl apply -f deploy.yaml

With centralized evidence, platform teams can compare reliability across agents, pipelines, and controllers, detect fleet-wide patterns, and answer questions like: which agents have incomplete prescribe/report pairs this week? Which controller workflows are retrying the same reconciliation? Which actor has the highest retry loop rate?

References: Self-hosted setup · Argo CD GitOps integration · API reference · Setup Evidra Action · Terraform CI quickstart

Supported Tools

Built-in adapters canonicalize artifacts across infrastructure tools into a normalized CanonicalAction model, enabling cross-tool comparison in a single evidence chain:

  • Kubernetes-family YAML via kubectl, helm, kustomize, and oc
  • Terraform plan JSON via terraform show -json
  • Docker/container inspect JSON
  • Generic fallback ingestion for unsupported tools

Full support details: Supported tools

Behavioral Signals

The evidence chain's prescribe/report structure makes agent behavior patterns visible without external instrumentation. Three signals fire immediately in real operations:

protocol_violation — a prescribe without a matching report (agent crashed, timed out, or skipped the protocol), a report without a prior prescribe (unauthorized action), duplicate reports, or cross-actor reports. This is the most operationally immediate signal — it fires whenever the protocol is broken.

retry_loop — the same intent retried multiple times within a window, typically after failures. Indicates an agent stuck in a retry cycle. Fires when the same intent digest appears 3+ times in 30 minutes with prior failures.

blast_radius — a destroy operation affecting more than 5 resources. Indicates a potentially high-impact deletion that warrants review.

Additional signals (artifact_drift, new_scope, repair_loop, thrashing, risk_escalation) contribute to scoring and mature as evidence accumulates. All eight are documented in the Signal specification.

Scoring details: Scoring model · Default profile rationale

Docs Map

Architecture and protocol:

Integration and operations:

Developer references:

Development

make build
make test
make e2e
make test-contracts
make test-mcp-inspector
make lint
make test-signals

License

Licensed under the Apache License 2.0.

About

Flight recorder for AI infrastructure agents. The prescribe/report protocol captures intent before execution and outcome after — in a signed, tamper-evident evidence chain. Detects behavioral patterns. Computes reliability scorecards. Speaks MCP.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors