Skip to content

belousov-petr/shakedown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shakedown

One-command shakedown for any codebase. Architecture, security, performance, resilience, docs - what's broken, what to fix first, and whether you or your agents can handle it.

Works on anything you can point it at: agent systems, pipelines, web apps, CLIs, microservices, infra.

It's a Claude Code Skill - six phases, one report, fixes ranked by impact.

I made this because I kept writing the same audit prompt from scratch every time I wanted to properly review a project. Started with a 12-agent pipeline on Paperclip, turned it into something that works on any project.

Shakedown

Why this exists

I was running a 12-agent intelligence pipeline and every time I wanted to review the whole thing, I'd write a long audit prompt from scratch. Every time, I'd forget something. Sometimes security. Sometimes backup validation. Sometimes I just wouldn't look at how the agents coordinate with each other.

It's never the obvious stuff. It's always the thing you assumed was fine until it wasn't.

So I made it into a skill. Same checklist, same order, every time. Now I run one command and know I'm not skipping anything. It also finds things I wasn't looking for, which surprised me. After using it on a few projects I figured other people might find it useful too.

What it does

One command. Full analysis. Actionable output.

/shakedown

The audit discovers your project structure dynamically, reads everything, queries databases, tests backup integrity, and produces a structured report covering:

  • Architecture and code quality - design patterns, MECE analysis, contradictions, algorithm efficiency, dependency graph, test coverage
  • Error handling and resilience - crash scenarios, timeout coverage, silent failures, data integrity, edge cases, retry patterns, graceful degradation
  • Performance and bottleneck analysis - timing, parallelism, scaling limits, resource waste, cost analysis
  • Code and storage efficiency - empty files, duplicates, dead dependencies, build artifacts, storage bloat
  • Security and data exposure - secrets, injection vulnerabilities, PII, supply chain, workflow security, licensing compliance
  • Logging and observability - structured logs, traceability, alerting, monitoring
  • Documentation quality - accuracy vs codebase, completeness, onboarding readiness
  • Value assessment - problem clarity, target audience, maturity vs claims, differentiation, adoption readiness
  • Agent skill compliance - agentskills.io spec validation (conditional, for skill projects)
  • Production readiness - 10-gate PASS/PARTIAL/FAIL checklist (you call the ship/no-ship yourself)
  • Ranked recommendations - top 10 actions with impact, effort, and who implements

Installation

Claude Code (CLI or Desktop)

Clone the full skill (recommended — includes reference checklists for deeper analysis):

git clone https://github.com/belousov-petr/shakedown.git
mkdir -p ~/.claude/skills/shakedown
cp -r shakedown/SKILL.md shakedown/references ~/.claude/skills/shakedown/

Or grab just the core skill file (works but loses 11 reference checklists — shallower audit):

mkdir -p ~/.claude/skills/shakedown
curl -o ~/.claude/skills/shakedown/SKILL.md \
  https://raw.githubusercontent.com/belousov-petr/shakedown/master/SKILL.md

Restart Claude Code. It shows up as /shakedown.

Any skills-compatible agent (cross-client)

Use the .agents/skills/ path for compatibility with Cursor, Gemini CLI, Copilot, and 30+ other clients:

git clone https://github.com/belousov-petr/shakedown.git
mkdir -p ~/.agents/skills/shakedown
cp -r shakedown/SKILL.md shakedown/references ~/.agents/skills/shakedown/

Or install at project level (travels with the repo):

mkdir -p .agents/skills/shakedown
git clone https://github.com/belousov-petr/shakedown.git /tmp/shk
cp -r /tmp/shk/SKILL.md /tmp/shk/references .agents/skills/shakedown/
rm -rf /tmp/shk

How it works

Phase 1: Discover

Looks at the project before assuming anything. Maps the structure, checks git history, reads the README, detects if it's an agent skill (triggers extra checks in Phase 4), then asks you to confirm scope before burning tokens.

Phase 2: Read (parallel)

Sends 4 agents to read everything at once. One covers config and architecture, one covers execution logic, one reads outputs and docs, one counts files and checks data stores.

Phase 3: Diagnose

If there's a database, it connects and queries it. Checks table sizes, failure rates, data freshness. If there's no database, it skips this.

Phase 4: Analyze

This is where the actual opinions come in. Architecture review, error handling audit, performance analysis, storage efficiency, resource waste. If the project is an agent skill, it also gets evaluated against the agentskills.io specification. Everything quantified where possible.

Phase 5: Assess

Security scan, PII check, documentation accuracy, whether the project actually does what it claims to do. Includes a value assessment — does this project solve a real problem, for a clear audience, with measurable value? Ends with the ranked recommendations and the uncomfortable question.

Phase 6: Test resilience

Checks whether backups actually restore (not just whether they exist). Traces what happens when components fail.

Usage

Go to your project directory and run:

/shakedown

It also activates from natural language. These phrases trigger the full audit:

audit this project
find the weak spots
how solid is this project
how mature is this project
what would break first
where does this need tightening
assess technical debt
do a project health check

Claude Code matches your request against the skill's description — any phrasing around auditing, health checking, or stress-testing a whole project should trigger it. It won't activate for simple code reviews or single-file analysis.

The report covers

  1. What the project is (derived from reading, not assumed)
  2. What's genuinely good, with evidence
  3. What will break soon, with evidence
  4. Architecture problems, MECE gaps, contradictions, algorithm efficiency
  5. Error handling: crash paths, silent failures, edge cases, retry patterns
  6. Performance: bottlenecks, waste, cost
  7. Code and storage efficiency: empty files, duplicates, dead deps, bloat
  8. Agent skill compliance (if applicable): spec, description, instructions, evals
  9. Security: secrets, injection risks, PII, licensing
  10. Logging and monitoring gaps
  11. Documentation: accuracy, completeness, onboarding
  12. Goal fulfillment: stated vs actual behavior
  13. Blind spots nobody is watching
  14. Ratings (8 dimensions, scored 1-10)
  15. Overall score with justification
  16. Production readiness (10 gates; you call the ship/no-ship yourself)
  17. Top 10 ranked fixes with effort estimates
  18. Value assessment: problem clarity, audience, maturity, differentiation
  19. The uncomfortable question

Focused checks (without running the full audit)

The reference files in references/ are self-contained checklists. You can use them individually for targeted reviews without running the full 6-phase audit. Just point your agent at the specific file:

If you want to check... Use this file
Architecture, MECE gaps, algorithms, test coverage references/architecture-quality.md
Error handling, crash paths, resilience references/error-resilience.md
Performance, bottlenecks, scaling, cost references/performance-analysis.md
File waste, duplicates, dead dependencies references/storage-efficiency.md
Agent skill spec compliance references/skill-standards.md
Security, secrets, injection, PII, licensing references/security-checklist.md
Logging, docs quality, blind spots references/operational-health.md
Problem clarity, audience, maturity, value references/value-assessment.md
Database health, schema, freshness references/db-diagnostics.md
Backup validation, disaster recovery references/resilience-testing.md

Example: "Review this project's security using the checklist in references/security-checklist.md"

How it's different from /ultrareview

Anthropic shipped /ultrareview in April 2026 alongside Opus 4.7. It's a PR-time bug hunter that runs in Anthropic's cloud, spawns multiple reviewer agents, and reproduces every finding before reporting. Worth using. But it's a different job from this skill.

/ultrareview /shakedown
Scope a branch / PR diff the whole project
Trigger before merge any time, not tied to a PR
Runtime Anthropic's cloud sandbox your local Claude Code
Output reproduced bug list (logic, edge cases, security, perf) 14-section report — architecture, security (OWASP depth), performance, resilience, docs, value assessment, a readiness checklist, the uncomfortable question
Cost 3 free on Pro/Max, then $5-20 per run free
Works on git repos with diffs any project — non-git, side projects, skills, pipelines

Use /ultrareview before merging a PR. Use /shakedown when you want to know what's weak across the whole project.

What to do with the report

The report is built so you can act on it immediately. Here's what you can say after the audit finishes:

Say this What happens
Fix them all Starts implementing all recommendations in priority order
Fix the critical ones Only tackles items rated Critical or FAIL
Explain recommendation #3 in detail Deep-dive into a specific finding with implementation steps
Re-run Phase 5 only Re-checks just one phase after you've made changes
Create GitHub issues for each recommendation Turns findings into trackable issues
Prioritize for a solo developer Filters by what a human needs to do vs what agents can handle
Compare with the last audit If you've run it before, diffs the reports to show progress

How it thinks

A few rules I keep coming back to when reviewing my own projects:

  1. Look at the project before making assumptions about it. Map first, read second.
  2. Check with the user before going deep. "This is what I found, this is what I'll audit - sound right?" saves everyone's time.
  3. Don't critique what you haven't read.
  4. Put numbers on things. "23% duplicate rate" is useful. "Some duplicates" is not.
  5. Compare what the docs say against what the code does. The gap between those two is where most problems hide.
  6. Every recommendation answers four things: what, why, how much work, who does it. Anything less is just complaining.
  7. The audit should be useful now, not next sprint. If you can say "fix them all" and start working immediately, it did its job.
  8. Test the safety nets. Backups exist? Restore one. Retry logic? Trace what happens when it fires. Don't report that something exists - report whether it works.
  9. Find what's wrong, not just what's right. The point is to make the project better, not to feel good about it.
  10. Surface the constraints nobody talks about: rate limits, daily budgets, peak hour pricing. Those shape what's actually possible more than architecture does.

Works on

Project type What gets checked
Agent systems (Paperclip, CrewAI, AutoGen) Instructions, heartbeats, coordination, pipeline flow, signal quality
Web apps Routes, API design, auth, DB schema, frontend/backend split
Data pipelines Stage flow, data integrity, scheduling, error handling, throughput
CLI tools Argument handling, error messages, edge cases, docs
Monorepos Package boundaries, dependencies, build system, cross-package consistency
Microservices Service boundaries, API contracts, resilience, observability

Benchmark results

Tested on 3 real projects: a personal media content pipeline (Python, ~56 files, yt-dlp + Whisper + static HTML gallery), a 17-agent Claude-based intelligence pipeline (live Postgres, 3,344 LOC of agent instructions + ~1,500 LOC Python), and a browser-automation tool (~1,950 LOC JavaScript, 4 files). Each project was audited twice — once with the skill active, once with a bare "audit this project" prompt. Both runs used Claude Opus 4.7 for a fair comparison. Output was graded against scripts/validate-output.py (21 structural checks, plus 1 conditional Skill Standards check for audits that cover agent-skill projects).

Project With skill Without skill Delta
Personal media content pipeline (medium Python) 22/22 (100%) 0/21 (0%) +100%
17-agent intelligence pipeline (large) 22/22 (100%) 2/21 (10%) +90%
Browser automation tool (small JS) 22/22 (100%) 1/21 (5%) +95%
Average 100% 5% +95%

The bare prompts find real bugs. They just don't organize them. Without the skill — even on Opus 4.7 — no readiness table, no ratings, no ranked fix list, no uncomfortable question. The findings are in there somewhere, but you'd have to re-read everything to act on them. Full reports in examples/.

Token usage

The skill itself costs ~3,500 tokens to load (the SKILL.md file). Measured token usage across the 3 reference audits above (Opus 4.7, 1M context):

Project type Files read Tokens Duration Example
Small CLI/browser tool (~2K LOC) 15 ~95K ~15 min benchmark-audit-c.md
Medium personal pipeline (~56 files + JSONL manifests) 32 ~165K ~45 min benchmark-audit-a.md
Large agent system (17 agents + live Postgres + 4K LOC) 34 ~125K ~32 min benchmark-audit-b.md

Token usage scales roughly with source-code size + database query count, not file count — the large agent system had fewer source files than the medium pipeline but hit a live Postgres with 68 tables during Phase 3.

Project structure

shakedown/
├── SKILL.md                              # The skill - orchestrator (479 lines)
├── references/
│   ├── architecture-quality.md           # Section 4.4: structure, MECE, algorithms, tests
│   ├── error-resilience.md               # Section 4.5: crashes, timeouts, edge cases
│   ├── performance-analysis.md           # Section 4.6: timing, scaling, cost
│   ├── storage-efficiency.md             # Section 4.7: empty files, duplicates, bloat
│   ├── skill-standards.md                # Section 4.8: agentskills.io compliance
│   ├── db-diagnostics.md                 # Phase 3: database-specific queries
│   ├── security-checklist.md             # Section 5.1: OWASP-derived — secrets, injection, PII, licensing, LLM/Agentic/MCP risks
│   ├── operational-health.md             # Sections 5.2/5.3/5.5: logging, docs, blind spots
│   ├── value-assessment.md               # Section 5.10: problem, audience, maturity
│   ├── resilience-testing.md             # Phase 6: backup and resilience tests
│   └── gotchas.md                        # 10 common agent audit mistakes
├── examples/
│   ├── benchmark-audit-a.md              # With-skill audit: content pipeline
│   ├── benchmark-audit-b.md              # With-skill audit: agent pipeline
│   ├── benchmark-audit-c.md              # With-skill audit: browser tool
│   ├── baseline-audit-a.md               # Without-skill baseline: content pipeline
│   ├── baseline-audit-b.md               # Without-skill baseline: agent pipeline
│   └── baseline-audit-c.md               # Without-skill baseline: browser tool
├── evals/
│   ├── evals.json                        # 7 test cases with 51 assertions
│   └── README.md                         # How to run and grade evals
├── scripts/
│   └── validate-output.py                # 22-check output completeness validator
├── README.md
└── LICENSE                               # MIT

The main SKILL.md is a clean orchestrator — phases, flow, output templates. All detailed checklists live in references/ (11 files, 3,202 lines) and are loaded on demand. This keeps activation cost low while preserving depth. Each reference file includes scope boundary notes to prevent overlap.

A few honest things

Before you use this on anything serious.

  • LLM audits spot patterns, not truth. What comes back is a pile of "this looks suspicious" — still on you to read the code, check whether it's actually a problem, and figure out the fix. Especially for security, or anything users actually touch. If a finding looks important, don't take my word for it. Read the code.

  • Don't mistake this for due diligence. Compliance, audit trails, threat models, multiple humans signing off — none of that is here. What you get is a first pass from one LLM. Useful, not definitive.

  • The fit is small stuff you care about but haven't cleaned up. Solo tools, prototypes, side experiments, things that have been sitting untouched for months. The idea is to un-slop them before the mess starts dictating decisions, and to walk away with a shortlist you can actually work through. For anything with real stakes, pay a human to look.

Contributing

If you've run this and found gaps, I'd like to hear about it. Open an issue or PR with:

  1. What kind of project you audited
  2. What the skill should have checked but didn't
  3. What you'd add to fill that gap

License

MIT. Use it, fork it, ship it — credit appreciated but not required.

Acknowledgments

Claude Code wrote this. I designed the audit flow and check structure, decided what gets audited and why, and directed the work. Claude did the typing - skill, references, tests, docs. For security and skill-compliance specifically, I reused two established sources instead of inventing my own:

  • Agent Skills specification. The skill is built to conform to it. Run npx skills-ref validate ./shakedown to check your install (frontmatter, naming, directory structure).
  • OWASP GenAI Security Project. The security checks, questions, and red-team procedures in references/security-checklist.md come straight from OWASP's GenAI publications - LLM Top 10, Agentic Top 10, Secure MCP, GenAI Data Security, Governance Checklist, Red Teaming Guide. If your project touches LLMs or agents, those documents are worth reading directly.

Author

Petr Belousov

About

One-command shakedown. Find what breaks before you ship - architecture, security, performance, resilience, docs. Claude Code skill.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages