Evidra — Flight recorder and reliability scoring for infrastructure AI agents
Your AI agent fixes Kubernetes. Can you prove it?
Evidra records intent, outcome, and refusal in a signed, append-only evidence chain. It shows risk before execution and reveals patterns like retry loops, drift, and escalation across agents, pipelines, and controllers.
Evidra informs, not enforces. It is the flight recorder and intelligent scoring engine.
| Records what happened | Shows risk before action | Agent can decline | Works with any model | |
|---|---|---|---|---|
| Proxy Observed | Yes | No | No | Yes |
| Smart Prescribe | Yes | Yes | Yes | Yes |
| Full Prescribe | Yes | Yes | Yes | Strong models only |
Proxy records silently — the agent never knows. Smart and full prescribe are explicit: the agent calls prescribe, receives risk assessment, and decides whether to proceed or decline. Smart prescribe uses 4 fields (~30 tokens); full prescribe sends the complete YAML artifact (~300 tokens) and enables drift detection.
In benchmarks across 5 models and 33 scenarios, agents that follow the prescribe/report protocol don't just record evidence — they make better decisions.
The protocol has a cost: every failed attempt requires a prescribe/report pair (~2 extra turns). Agents that brute-force retries burn through their turn budget. Agents that diagnose first and apply once succeed with the same turn budget.
In one scenario, GPT-5.2 retried a broken manifest 3 times in smart mode (6 turns on protocol for failed attempts) and ran out of turns. Claude Sonnet 4 read the manifest, caught the namespace mismatch, fixed it, and applied once — same protocol, zero wasted turns.
The protocol doesn't slow good agents down. It reveals which agents think before acting. That's exactly the signal you want in production infrastructure.
Every infrastructure mutation follows the same lifecycle:
prescribe → record intent, risk assessment, canonical form
execute → run the command (or decline to act)
report → record verdict, exit code, or refusal reason
prescribe_full and prescribe_smart capture intent before the command runs. prescribe_full records the artifact, its canonical form, digests, the per-source risk_inputs panel, and the rolled-up effective_risk. prescribe_smart records lightweight target context when artifact bytes are not available. report captures what actually happened — success, failure, or an explicit decision not to act, with structured context for each.
The evidence chain links prescriptions to reports through signed entries with hash chaining. Every entry is timestamped, actor-attributed, and cryptographically verifiable. Evidence cannot be modified after the fact.
When an agent decides not to execute — because risk is too high, because the operation looks wrong — that decision is a first-class evidence entry with trigger and reason. Not a silent gap in the log.
Evidra is one platform with three operating surfaces:
| Surface | What it does |
|---|---|
evidra CLI |
Wraps live commands, imports completed operations, computes scorecards |
evidra-mcp |
Exposes the prescribe/report protocol to MCP-connected agents and runtimes |
| Self-hosted API | Centralizes evidence across agents, pipelines, and controllers, and provides team-wide analytics |
From the evidence chain, Evidra computes:
- Risk classification at operation time —
risk_inputs,effective_risk, canonical action digest - Behavioral signals — protocol violations, retry loops, blast radius detection
- Reliability scorecards — score, band, and confidence for comparing agents, sessions, and time windows
Evidra does not replace OTel, Datadog, or Logfire. They record execution telemetry. Evidra records what they cannot: intent before execution, structured decisions, and behavioral patterns across the agent lifecycle.
CLI and MCP are the authoritative analytics surfaces today.
# Homebrew
brew install samebits/tap/evidra
# Binary release (Linux/macOS)
curl -fsSL https://github.com/samebits/evidra/releases/latest/download/evidra_$(uname -s | tr '[:upper:]' '[:lower:]')_$(uname -m | sed 's/x86_64/amd64/;s/aarch64/arm64/').tar.gz \
| tar -xz -C /usr/local/bin evidra
# Build from source
make buildevidra keygen
export EVIDRA_SIGNING_KEY=<base64>
evidra record -f deploy.yaml -- kubectl apply -f deploy.yamlFor local smoke runs without a signing key:
export EVIDRA_SIGNING_MODE=optionalThe output includes: risk_inputs, effective_risk, score, score_band, signal_summary, basis, and confidence.
evidra scorecard --period 30d
evidra explain --period 30dSecurity boundary: evidra record executes the wrapped local command directly. Evidra does not sandbox the command. Treat it with the same trust model as direct shell execution — Evidra records evidence around the command, not contain it.
Evidra speaks MCP. The MCP server exposes the prescribe/report protocol to any MCP-connected agent or runtime.
evidra-mcp --evidence-dir ~/.evidra/evidenceThe MCP server gives agents the tools. The skill teaches them when and how to use them — agents with the skill achieve 100% protocol compliance for infrastructure mutations.
evidra skill installHow the protocol looks from the agent's perspective:
Agent: "I need to kubectl apply this deployment"
→ prescribe_smart(tool=kubectl, operation=apply, resource=deployment/web, namespace=default)
← prescription_id, effective_risk=medium, risk_inputs=[{source=evidra/matrix, ...}]
Agent: decides to proceed (or decline based on risk)
→ executes kubectl apply
→ report(prescription_id=..., verdict=success, exit_code=0)
← score=95, score_band=excellent, signal_summary={...}
If the agent decides not to act:
Agent: "Risk too high, declining"
→ report(prescription_id=..., verdict=declined, decision_context={
trigger: "risk_threshold_exceeded",
reason: "privileged container in production"
})
Declined verdicts are first-class evidence — not silent gaps in the log.
Proxy Observed — one config line, zero agent changes:
{
"mcpServers": {
"infra": {
"command": "evidra-mcp",
"args": ["--proxy", "--", "npx", "-y", "@anthropic/mcp-server-kubernetes"]
}
}
}References: MCP setup guide · Skill setup guide · Execution schemas
The prescribe/report protocol also works without MCP. Two CLI modes feed the same lifecycle and scoring engine:
evidra record wraps a live command and records the full prescribe/execute/report lifecycle in one step. evidra import ingests a completed operation from structured input for pipelines that manage execution separately.
# Wrap a live command
evidra record -f deploy.yaml -- kubectl apply -f deploy.yaml
# Import a completed operation
evidra import --input record.jsonAdditional workflows: prescribe, report, scorecard, explain, compare, validate, import-findings.
References: CLI reference · Record/Import contract
Run the Evidra backend to centralize evidence collection across agents, pipelines, and GitOps controllers, and get team-wide analytics. Argo CD is controller-first in v1; webhook ingestion remains supported, but it is not the only GitOps path.
export EVIDRA_API_KEY=my-secret-key
docker compose up --build -d
curl http://localhost:8080/healthzThe CLI forwards evidence to the backend:
evidra record --url http://localhost:8080 --api-key my-secret-key \
-f deploy.yaml -- kubectl apply -f deploy.yamlWith centralized evidence, platform teams can compare reliability across agents, pipelines, and controllers, detect fleet-wide patterns, and answer questions like: which agents have incomplete prescribe/report pairs this week? Which controller workflows are retrying the same reconciliation? Which actor has the highest retry loop rate?
References: Self-hosted setup · Argo CD GitOps integration · API reference · Setup Evidra Action · Terraform CI quickstart
Built-in adapters canonicalize artifacts across infrastructure tools into a normalized CanonicalAction model, enabling cross-tool comparison in a single evidence chain:
- Kubernetes-family YAML via
kubectl,helm,kustomize, andoc - Terraform plan JSON via
terraform show -json - Docker/container inspect JSON
- Generic fallback ingestion for unsupported tools
Full support details: Supported tools
The evidence chain's prescribe/report structure makes agent behavior patterns visible without external instrumentation. Three signals fire immediately in real operations:
protocol_violation — a prescribe without a matching report (agent crashed, timed out, or skipped the protocol), a report without a prior prescribe (unauthorized action), duplicate reports, or cross-actor reports. This is the most operationally immediate signal — it fires whenever the protocol is broken.
retry_loop — the same intent retried multiple times within a window, typically after failures. Indicates an agent stuck in a retry cycle. Fires when the same intent digest appears 3+ times in 30 minutes with prior failures.
blast_radius — a destroy operation affecting more than 5 resources. Indicates a potentially high-impact deletion that warrants review.
Additional signals (artifact_drift, new_scope, repair_loop, thrashing, risk_escalation) contribute to scoring and mature as evidence accumulates. All eight are documented in the Signal specification.
Scoring details: Scoring model · Default profile rationale
Architecture and protocol:
- V1 Architecture
- Prescribe/Report Protocol
- Core Data Model
- Canonicalization Contract
- Signal Specification
- Scoring Rationale
Integration and operations:
- CLI Reference
- MCP Setup Guide
- Skill Setup Guide
- API Reference
- Supported Tools
- Observability Quickstart
- Scanner SARIF Quickstart
- Self-Hosted Setup Guide
Developer references:
make build
make test
make e2e
make test-contracts
make test-mcp-inspector
make lint
make test-signalsLicensed under the Apache License 2.0.