The always-on judge for AI agents. Workflow memory + self-improving enforcement.
One command install. Invisible hooks fire on every prompt, tool call, and session stop. Every agent session is tracked, judged for completeness, and improved from corrections. Works with Claude Code, Cursor, OpenAI Agents SDK, LangChain, CrewAI, and any MCP-compatible agent.
# Install (30 seconds, zero config)
curl -sL attrition.sh/install | bash
# That's it. Judge hooks activate automatically.
# Every session is now tracked and judged.
# View captured workflows
bp workflows
# Distill a frontier workflow for cheaper replay
bp distill <id> --target claude-sonnet-4-6
# Check judge corrections
bp judge --show-correctionsattrition.sh installs 4 hooks into your agent runtime. They fire automatically — no manual invocation.
| Hook | When | What |
|---|---|---|
on-session-start |
Agent starts | Resume incomplete workflows from prior sessions |
on-prompt |
User types prompt | Detect workflow patterns, inject required steps into context |
on-tool-use |
Each tool call | Track evidence, nudge when required steps are missing |
on-stop |
Agent tries to stop | Full completion judge — block if mandatory steps missing |
| Verdict | Action |
|---|---|
correct |
Allow stop. All required steps have evidence. |
partial |
Allow stop. Minor steps missing, logged for learning. |
escalate |
Strong nudge. >50% steps missing. Agent should continue. |
failed |
Block stop. <50% mandatory steps done. Lists missing steps. |
Corrections feed back into workflow definitions. When the judge notices repeated patterns — "you forgot the search step" or "you skipped QA" — it tightens enforcement automatically. Inspired by Meta's HyperAgents DGM-H architecture: the judge improves its own improvement process.
Session → Judge scores completeness → Correction detected
↓ ↓
Workflow memory ← ← ← ← ← ← ← ← ← Tighten enforcement
↓
Next session → Better coverage → Higher scores
attrition.sh works with every major agent runtime:
| Runtime | Integration |
|---|---|
| Claude Code | Native hooks (PostToolUse, Stop, SessionStart, UserPromptSubmit) |
| Cursor | MCP server + rule injection |
| Windsurf | MCP server + rule injection |
| OpenAI Agents SDK | TracingProcessor for span-level tracking |
| Anthropic SDK | Monkey-patches Messages.create |
| LangChain | Callback handler |
| CrewAI | @before_tool_call / @after_tool_call decorators |
| PydanticAI | OTEL/logfire integration |
| Any MCP client | JSON-RPC bp.judge.* tools |
# Python SDK — one line for any provider
from attrition import track
track() # Auto-detects and patches your agent runtime12-crate Rust workspace:
attrition/
rust/crates/
core/ Core types, config, error handling
workflow/ Canonical event capture + SQLite storage
distiller/ 4-strategy workflow compression (40-65% reduction)
judge/ Always-on judge engine (verdict + nudge + attention)
llm-client/ Anthropic Messages API client
api/ Axum HTTP API server
mcp-server/ MCP protocol (12 bp.* tools)
qa-engine/ Browser automation, crawling, UX audit
agents/ Multi-agent orchestration
cli/ CLI binary (bp), 11 subcommands
telemetry/ Structured logging via tracing
sdk/ Rust SDK client
frontend/ React 19 + Vite + TypeScript
Agent session (any provider)
|
always-on hooks --> Canonical events --> SQLite workflow memory
|
bp distill --> Eliminate redundant steps (40-65%)
| Extract copy-paste blocks
| Compress reasoning
| Insert checkpoints
|
judge (automatic) --> Compare expected vs actual
| Nudge on divergence
| Block on failure
| Learn from corrections
| Tool | Description |
|---|---|
bp.judge.start |
Start judge session for workflow replay |
bp.judge.event |
Report actual event, get nudge if divergent |
bp.judge.verdict |
Finalize session, produce verdict |
bp.capture |
Parse session, save as replayable workflow |
bp.workflows |
List all captured workflows |
bp.distill |
Distill workflow for cheaper model replay |
bp.check |
Full QA check |
bp.sitemap |
Crawl + sitemap |
bp.ux_audit |
21-rule UX audit |
bp.diff_crawl |
Before/after comparison |
bp.workflow |
Start workflow recording |
bp.pipeline |
Full QA pipeline |
| Command | Description |
|---|---|
bp serve |
Start API + MCP server (judge hooks via HTTP) |
bp capture <path> |
Capture agent session as workflow |
bp workflows |
List all captured workflows |
bp distill <id> |
Distill workflow for cheaper model replay |
bp judge <id> |
Start judge session for replay verification |
bp check <url> |
Run QA check |
bp sitemap <url> |
Crawl and generate sitemap |
bp audit <url> |
21-rule UX audit |
bp diff <url> |
Before/after comparison crawl |
bp pipeline <url> |
Full QA pipeline |
bp health |
Server health status |
bp info |
Version and system info |
cargo build --workspace # Build all 12 crates
cargo test --workspace # 87 tests
cargo build --release -p attrition-cli # Release binary
bp serve --port 8100 # API + MCP
cd frontend && npm run dev # Frontend on 5173- Website: attrition.sh
- GitHub: github.com/HomenShum/attrition
- Inspiration: Meta HyperAgents (self-improving agent meta-loop)
MIT
We applied the same behavioral design principles that made Linear, Perplexity, ChatGPT, Notion, and Vercel feel premium — and found that both NodeBench and attrition.sh violate all five of them. This section is a permanent record of that audit and the execution plan.
What premium products do: ChatGPT has one text box. Perplexity has one search bar. Linear lets you create an issue in 3 seconds from Cmd+K. The first pixel IS the first action.
What we do wrong: Both products lead with explanation pages, competitive tables, feature cards, and navigation systems. The user must understand what we are before they can use us.
Fix: The first thing on screen must be the thing you do. For attrition: a scan input. For NodeBench: the Ask search bar. Everything else is below the fold.
What premium products do: Linear renders in sub-50ms. ChatGPT streams responses so 3 seconds feels like watching someone think. Perplexity shows sources progressively.
What we do wrong: Attrition's chat panel has hardcoded fake delays. Cloud Run cold starts take 1-5s with no feedback. NodeBench's pipeline has no progressive streaming of answer sections. No skeleton loading on surface transitions.
Fix: Hard latency budgets — first visible response < 800ms, first source < 2s, first complete section < 5s. Progressive rendering, not batch reveals.
What premium products do: Every ChatGPT conversation is a screenshot people share. Every Perplexity answer has a shareable URL with citations. TikTok watermarks videos for cross-platform sharing.
What we do wrong: Neither product generates shareable URLs for results. No screenshot-worthy artifact. No "send this to a colleague" moment.
Fix: Generate shareable result URLs (/scan/:id, /report/:id) that render without auth. Design result cards as screenshot-worthy single visuals.
What premium products do: Linear has Cmd+K everywhere. ChatGPT's absence of UI IS the UI. Products meet users in their existing workflow, not in a new navigation system.
What we do wrong: Attrition has 11 pages with 4+ nav tabs. NodeBench has 5 surfaces with sidebar + top nav + bottom nav. Users must learn a navigation system before getting value.
Fix: Make chat/search the primary surface. Everything reachable from one input. URL-based queries (?q= or ?scan=) that skip all navigation.
What premium products do: TikTok's algorithm gets better with every swipe. ChatGPT's memory makes later interactions more relevant. Notion AI fits into existing blocks.
What we do wrong: No visible learning in either product. The infrastructure exists (correction learner, Me context, workflow memory) but nothing in the UI says "I'm getting better for you."
Fix: Show "based on your previous N sessions" suggestions. Show correction learning visibly. Make returning users see personalized context that proves the product knows them.
attrition.sh: 11 pages (Landing, Proof, Improvements, Get Started, Live, Workflows, Judge, Anatomy, Benchmark, Compare, Chat) for a product that does ONE thing — catch when agents skip steps. Should be 3 surfaces: scanner + chat + docs.
NodeBench: 5 surfaces (Ask, Workspace, Packets, History, Connect) plus Oracle, flywheel, trajectory, benchmark, and dogfood surfaces. The MCP server has 350+ tools across 57 domains. Should follow the Addy Osmani agent-skills pattern: each skill = ONE thing, ONE workflow.
Both products have MCP tool registries that grew by accretion, not by design.
NodeBench MCP: 350+ tools, 57 domains, progressive discovery layers, analytics client, embedding index, dashboard launcher, profiling hooks — all in the boot path. Performance is self-benchmarked, not user-value-benchmarked.
attrition MCP: 12 tools where 6 would do. bp.sitemap, bp.ux_audit, bp.diff_crawl, bp.workflow, bp.pipeline, bp.workflows are sub-features of bp.check and bp.capture.
What good looks like (Addy Osmani's agent-skills):
- Each skill is ONE thing with ONE workflow
- README shows: what it does, how to use it, what you get
- No discovery layer — install what you want
- No 350-tool registry — 5 skills that each do 1 thing well
| # | Principle | Fix | Metric to enforce | Ship order |
|---|---|---|---|---|
| 1 | Value before identity | First pixel = input field, not explanation | Time from load to first action < 5s | Week 1 |
| 2 | Speed as feature | Progressive rendering, remove fake delays, hard latency budgets | First visible result < 800ms | Week 1 |
| 3 | Output = distribution | Shareable result URLs, screenshot-worthy cards | Every result has a shareable URL | Week 2 |
| 4 | Meet users where they are | Chat/search as primary surface, collapse nav | User can do everything from one input | Week 2 |
| 5 | Product improves itself | Visible learning, personalized suggestions | Returning user sees context from prior sessions | Week 3 |
| 6 | MCP discipline | Reduce to core tools, one workflow per skill | attrition: 6 tools. NodeBench: skill-based, not registry-based | Week 3 |
- One dominant job per screen — Notion frames the problem as software sprawl. The fix is subtracting tools, not adding surfaces.
- Trust comes from visible reasoning, not decorative UI — Linear and Perplexity build trust through transparent reasoning and cited sources, not bordered cards.
- Speed is product behavior, not backend optimization — If it takes >200ms, make it faster. Premium feel comes from response cadence and zero hesitation.
- Quality is a system, not a cleanup sprint — Linear has Quality Wednesdays (1,000+ small fixes) and zero-bugs policy (fix now or explicitly decline).
- The product gets more useful as it knows more context — ChatGPT memory, Notion AI in existing blocks, Perplexity exportable artifacts.
Without a permanent quality lane, the UI will drift back into inconsistency.
- Weekly: papercut pass — motion, spacing, hover, focus, empty-state review
- Per-push: no bug backlog dumping — bugs are fixed now or explicitly declined
- Instrumented: time-to-value metrics, not just render counts
ask_submitted_atfirst_partial_answer_atfirst_source_atfirst_saved_report_atfirst_return_visit_at
Both products should feel like Perplexity for their domain: one input, one answer, shareable results, visibly getting smarter.
Not: multi-surface dashboards with competitive comparison tables and 350-tool registries.
- Linear on speed + transparent reasoning
- Perplexity answer engine model
- Notion on software sprawl
- Vercel virtual product tour
- ChatGPT memory + connected apps
- Addy Osmani agent-skills
- Meta HyperAgents
- Linear Quality Wednesdays
- Linear Zero-bugs policy
- Full audit:
docs/BEHAVIORAL_DESIGN_AUDIT.md
NodeBench AI = flagship user surface
nodebench-mcp = embedded workflow lane
Attrition.sh = measured replay + optimization lane
Attrition is NOT a third flagship. It is the measurable optimization lane for the same NodeBench workflow. One job: capture, measure, compress, replay, prove savings.
Full spec: docs/THREE_PRODUCT_STACK_SPEC.md