One-command shakedown for any codebase. Architecture, security, performance, resilience, docs - what's broken, what to fix first, and whether you or your agents can handle it.
Works on anything you can point it at: agent systems, pipelines, web apps, CLIs, microservices, infra.
It's a Claude Code Skill - six phases, one report, fixes ranked by impact.
I made this because I kept writing the same audit prompt from scratch every time I wanted to properly review a project. Started with a 12-agent pipeline on Paperclip, turned it into something that works on any project.
I was running a 12-agent intelligence pipeline and every time I wanted to review the whole thing, I'd write a long audit prompt from scratch. Every time, I'd forget something. Sometimes security. Sometimes backup validation. Sometimes I just wouldn't look at how the agents coordinate with each other.
It's never the obvious stuff. It's always the thing you assumed was fine until it wasn't.
So I made it into a skill. Same checklist, same order, every time. Now I run one command and know I'm not skipping anything. It also finds things I wasn't looking for, which surprised me. After using it on a few projects I figured other people might find it useful too.
One command. Full analysis. Actionable output.
/shakedown
The audit discovers your project structure dynamically, reads everything, queries databases, tests backup integrity, and produces a structured report covering:
- Architecture and code quality - design patterns, MECE analysis, contradictions, algorithm efficiency, dependency graph, test coverage
- Error handling and resilience - crash scenarios, timeout coverage, silent failures, data integrity, edge cases, retry patterns, graceful degradation
- Performance and bottleneck analysis - timing, parallelism, scaling limits, resource waste, cost analysis
- Code and storage efficiency - empty files, duplicates, dead dependencies, build artifacts, storage bloat
- Security and data exposure - secrets, injection vulnerabilities, PII, supply chain, workflow security, licensing compliance
- Logging and observability - structured logs, traceability, alerting, monitoring
- Documentation quality - accuracy vs codebase, completeness, onboarding readiness
- Value assessment - problem clarity, target audience, maturity vs claims, differentiation, adoption readiness
- Agent skill compliance - agentskills.io spec validation (conditional, for skill projects)
- Production readiness - 10-gate PASS/PARTIAL/FAIL checklist (you call the ship/no-ship yourself)
- Ranked recommendations - top 10 actions with impact, effort, and who implements
Clone the full skill (recommended — includes reference checklists for deeper analysis):
git clone https://github.com/belousov-petr/shakedown.git
mkdir -p ~/.claude/skills/shakedown
cp -r shakedown/SKILL.md shakedown/references ~/.claude/skills/shakedown/Or grab just the core skill file (works but loses 11 reference checklists — shallower audit):
mkdir -p ~/.claude/skills/shakedown
curl -o ~/.claude/skills/shakedown/SKILL.md \
https://raw.githubusercontent.com/belousov-petr/shakedown/master/SKILL.mdRestart Claude Code. It shows up as /shakedown.
Use the .agents/skills/ path for compatibility with Cursor, Gemini CLI, Copilot, and 30+ other clients:
git clone https://github.com/belousov-petr/shakedown.git
mkdir -p ~/.agents/skills/shakedown
cp -r shakedown/SKILL.md shakedown/references ~/.agents/skills/shakedown/Or install at project level (travels with the repo):
mkdir -p .agents/skills/shakedown
git clone https://github.com/belousov-petr/shakedown.git /tmp/shk
cp -r /tmp/shk/SKILL.md /tmp/shk/references .agents/skills/shakedown/
rm -rf /tmp/shkLooks at the project before assuming anything. Maps the structure, checks git history, reads the README, detects if it's an agent skill (triggers extra checks in Phase 4), then asks you to confirm scope before burning tokens.
Sends 4 agents to read everything at once. One covers config and architecture, one covers execution logic, one reads outputs and docs, one counts files and checks data stores.
If there's a database, it connects and queries it. Checks table sizes, failure rates, data freshness. If there's no database, it skips this.
This is where the actual opinions come in. Architecture review, error handling audit, performance analysis, storage efficiency, resource waste. If the project is an agent skill, it also gets evaluated against the agentskills.io specification. Everything quantified where possible.
Security scan, PII check, documentation accuracy, whether the project actually does what it claims to do. Includes a value assessment — does this project solve a real problem, for a clear audience, with measurable value? Ends with the ranked recommendations and the uncomfortable question.
Checks whether backups actually restore (not just whether they exist). Traces what happens when components fail.
Go to your project directory and run:
/shakedown
It also activates from natural language. These phrases trigger the full audit:
audit this project
find the weak spots
how solid is this project
how mature is this project
what would break first
where does this need tightening
assess technical debt
do a project health check
Claude Code matches your request against the skill's description — any phrasing around auditing, health checking, or stress-testing a whole project should trigger it. It won't activate for simple code reviews or single-file analysis.
- What the project is (derived from reading, not assumed)
- What's genuinely good, with evidence
- What will break soon, with evidence
- Architecture problems, MECE gaps, contradictions, algorithm efficiency
- Error handling: crash paths, silent failures, edge cases, retry patterns
- Performance: bottlenecks, waste, cost
- Code and storage efficiency: empty files, duplicates, dead deps, bloat
- Agent skill compliance (if applicable): spec, description, instructions, evals
- Security: secrets, injection risks, PII, licensing
- Logging and monitoring gaps
- Documentation: accuracy, completeness, onboarding
- Goal fulfillment: stated vs actual behavior
- Blind spots nobody is watching
- Ratings (8 dimensions, scored 1-10)
- Overall score with justification
- Production readiness (10 gates; you call the ship/no-ship yourself)
- Top 10 ranked fixes with effort estimates
- Value assessment: problem clarity, audience, maturity, differentiation
- The uncomfortable question
The reference files in references/ are self-contained checklists. You can use them individually for targeted reviews without running the full 6-phase audit. Just point your agent at the specific file:
| If you want to check... | Use this file |
|---|---|
| Architecture, MECE gaps, algorithms, test coverage | references/architecture-quality.md |
| Error handling, crash paths, resilience | references/error-resilience.md |
| Performance, bottlenecks, scaling, cost | references/performance-analysis.md |
| File waste, duplicates, dead dependencies | references/storage-efficiency.md |
| Agent skill spec compliance | references/skill-standards.md |
| Security, secrets, injection, PII, licensing | references/security-checklist.md |
| Logging, docs quality, blind spots | references/operational-health.md |
| Problem clarity, audience, maturity, value | references/value-assessment.md |
| Database health, schema, freshness | references/db-diagnostics.md |
| Backup validation, disaster recovery | references/resilience-testing.md |
Example: "Review this project's security using the checklist in references/security-checklist.md"
Anthropic shipped /ultrareview in April 2026 alongside Opus 4.7. It's a PR-time bug hunter that runs in Anthropic's cloud, spawns multiple reviewer agents, and reproduces every finding before reporting. Worth using. But it's a different job from this skill.
| /ultrareview | /shakedown | |
|---|---|---|
| Scope | a branch / PR diff | the whole project |
| Trigger | before merge | any time, not tied to a PR |
| Runtime | Anthropic's cloud sandbox | your local Claude Code |
| Output | reproduced bug list (logic, edge cases, security, perf) | 14-section report — architecture, security (OWASP depth), performance, resilience, docs, value assessment, a readiness checklist, the uncomfortable question |
| Cost | 3 free on Pro/Max, then $5-20 per run | free |
| Works on | git repos with diffs | any project — non-git, side projects, skills, pipelines |
Use /ultrareview before merging a PR. Use /shakedown when you want to know what's weak across the whole project.
The report is built so you can act on it immediately. Here's what you can say after the audit finishes:
| Say this | What happens |
|---|---|
Fix them all |
Starts implementing all recommendations in priority order |
Fix the critical ones |
Only tackles items rated Critical or FAIL |
Explain recommendation #3 in detail |
Deep-dive into a specific finding with implementation steps |
Re-run Phase 5 only |
Re-checks just one phase after you've made changes |
Create GitHub issues for each recommendation |
Turns findings into trackable issues |
Prioritize for a solo developer |
Filters by what a human needs to do vs what agents can handle |
Compare with the last audit |
If you've run it before, diffs the reports to show progress |
A few rules I keep coming back to when reviewing my own projects:
- Look at the project before making assumptions about it. Map first, read second.
- Check with the user before going deep. "This is what I found, this is what I'll audit - sound right?" saves everyone's time.
- Don't critique what you haven't read.
- Put numbers on things. "23% duplicate rate" is useful. "Some duplicates" is not.
- Compare what the docs say against what the code does. The gap between those two is where most problems hide.
- Every recommendation answers four things: what, why, how much work, who does it. Anything less is just complaining.
- The audit should be useful now, not next sprint. If you can say "fix them all" and start working immediately, it did its job.
- Test the safety nets. Backups exist? Restore one. Retry logic? Trace what happens when it fires. Don't report that something exists - report whether it works.
- Find what's wrong, not just what's right. The point is to make the project better, not to feel good about it.
- Surface the constraints nobody talks about: rate limits, daily budgets, peak hour pricing. Those shape what's actually possible more than architecture does.
| Project type | What gets checked |
|---|---|
| Agent systems (Paperclip, CrewAI, AutoGen) | Instructions, heartbeats, coordination, pipeline flow, signal quality |
| Web apps | Routes, API design, auth, DB schema, frontend/backend split |
| Data pipelines | Stage flow, data integrity, scheduling, error handling, throughput |
| CLI tools | Argument handling, error messages, edge cases, docs |
| Monorepos | Package boundaries, dependencies, build system, cross-package consistency |
| Microservices | Service boundaries, API contracts, resilience, observability |
Tested on 3 real projects: a personal media content pipeline (Python, ~56 files, yt-dlp + Whisper + static HTML gallery), a 17-agent Claude-based intelligence pipeline (live Postgres, 3,344 LOC of agent instructions + ~1,500 LOC Python), and a browser-automation tool (~1,950 LOC JavaScript, 4 files). Each project was audited twice — once with the skill active, once with a bare "audit this project" prompt. Both runs used Claude Opus 4.7 for a fair comparison. Output was graded against scripts/validate-output.py (21 structural checks, plus 1 conditional Skill Standards check for audits that cover agent-skill projects).
| Project | With skill | Without skill | Delta |
|---|---|---|---|
| Personal media content pipeline (medium Python) | 22/22 (100%) | 0/21 (0%) | +100% |
| 17-agent intelligence pipeline (large) | 22/22 (100%) | 2/21 (10%) | +90% |
| Browser automation tool (small JS) | 22/22 (100%) | 1/21 (5%) | +95% |
| Average | 100% | 5% | +95% |
The bare prompts find real bugs. They just don't organize them. Without the skill — even on Opus 4.7 — no readiness table, no ratings, no ranked fix list, no uncomfortable question. The findings are in there somewhere, but you'd have to re-read everything to act on them. Full reports in examples/.
The skill itself costs ~3,500 tokens to load (the SKILL.md file). Measured token usage across the 3 reference audits above (Opus 4.7, 1M context):
| Project type | Files read | Tokens | Duration | Example |
|---|---|---|---|---|
| Small CLI/browser tool (~2K LOC) | 15 | ~95K | ~15 min | benchmark-audit-c.md |
| Medium personal pipeline (~56 files + JSONL manifests) | 32 | ~165K | ~45 min | benchmark-audit-a.md |
| Large agent system (17 agents + live Postgres + 4K LOC) | 34 | ~125K | ~32 min | benchmark-audit-b.md |
Token usage scales roughly with source-code size + database query count, not file count — the large agent system had fewer source files than the medium pipeline but hit a live Postgres with 68 tables during Phase 3.
shakedown/
├── SKILL.md # The skill - orchestrator (479 lines)
├── references/
│ ├── architecture-quality.md # Section 4.4: structure, MECE, algorithms, tests
│ ├── error-resilience.md # Section 4.5: crashes, timeouts, edge cases
│ ├── performance-analysis.md # Section 4.6: timing, scaling, cost
│ ├── storage-efficiency.md # Section 4.7: empty files, duplicates, bloat
│ ├── skill-standards.md # Section 4.8: agentskills.io compliance
│ ├── db-diagnostics.md # Phase 3: database-specific queries
│ ├── security-checklist.md # Section 5.1: OWASP-derived — secrets, injection, PII, licensing, LLM/Agentic/MCP risks
│ ├── operational-health.md # Sections 5.2/5.3/5.5: logging, docs, blind spots
│ ├── value-assessment.md # Section 5.10: problem, audience, maturity
│ ├── resilience-testing.md # Phase 6: backup and resilience tests
│ └── gotchas.md # 10 common agent audit mistakes
├── examples/
│ ├── benchmark-audit-a.md # With-skill audit: content pipeline
│ ├── benchmark-audit-b.md # With-skill audit: agent pipeline
│ ├── benchmark-audit-c.md # With-skill audit: browser tool
│ ├── baseline-audit-a.md # Without-skill baseline: content pipeline
│ ├── baseline-audit-b.md # Without-skill baseline: agent pipeline
│ └── baseline-audit-c.md # Without-skill baseline: browser tool
├── evals/
│ ├── evals.json # 7 test cases with 51 assertions
│ └── README.md # How to run and grade evals
├── scripts/
│ └── validate-output.py # 22-check output completeness validator
├── README.md
└── LICENSE # MIT
The main SKILL.md is a clean orchestrator — phases, flow, output templates. All detailed checklists live in references/ (11 files, 3,202 lines) and are loaded on demand. This keeps activation cost low while preserving depth. Each reference file includes scope boundary notes to prevent overlap.
Before you use this on anything serious.
-
LLM audits spot patterns, not truth. What comes back is a pile of "this looks suspicious" — still on you to read the code, check whether it's actually a problem, and figure out the fix. Especially for security, or anything users actually touch. If a finding looks important, don't take my word for it. Read the code.
-
Don't mistake this for due diligence. Compliance, audit trails, threat models, multiple humans signing off — none of that is here. What you get is a first pass from one LLM. Useful, not definitive.
-
The fit is small stuff you care about but haven't cleaned up. Solo tools, prototypes, side experiments, things that have been sitting untouched for months. The idea is to un-slop them before the mess starts dictating decisions, and to walk away with a shortlist you can actually work through. For anything with real stakes, pay a human to look.
If you've run this and found gaps, I'd like to hear about it. Open an issue or PR with:
- What kind of project you audited
- What the skill should have checked but didn't
- What you'd add to fill that gap
MIT. Use it, fork it, ship it — credit appreciated but not required.
Claude Code wrote this. I designed the audit flow and check structure, decided what gets audited and why, and directed the work. Claude did the typing - skill, references, tests, docs. For security and skill-compliance specifically, I reused two established sources instead of inventing my own:
- Agent Skills specification. The skill is built to conform to it. Run
npx skills-ref validate ./shakedownto check your install (frontmatter, naming, directory structure). - OWASP GenAI Security Project. The security checks, questions, and red-team procedures in
references/security-checklist.mdcome straight from OWASP's GenAI publications - LLM Top 10, Agentic Top 10, Secure MCP, GenAI Data Security, Governance Checklist, Red Teaming Guide. If your project touches LLMs or agents, those documents are worth reading directly.
Petr Belousov
- GitHub: @belousov-petr
- LinkedIn: petrbelousov
