cavekit

why agent guess when agent can know

Install • Before/After • How It Works • Quick Start • Parallel Execution • Codex Review • Commands • Examples

Part of the Caveman ecosystem

A Claude Code plugin that turns natural language into specs, specs into parallel build plans, and build plans into working software — with automated iteration, validation, and dual-model adversarial review.

You describe what you want. Cavekit writes the contract. Agents build from the contract. Every line of code traces to a requirement. Every requirement has acceptance criteria. Nothing gets lost, nothing gets guessed.

Before / After

Without Cavekit

> Build me a task management API

  (agent writes 2000 lines)
  (no tests)
  (forgot the auth middleware)
  (wrong database schema)
  (you spend 3 hours fixing it)

One shot. No validation. No traceability. The agent guessed what you wanted.

With Cavekit

> /ck:sketch
  4 kits, 22 requirements, 69 criteria

> /ck:map
  34 tasks across 5 dependency tiers

> /ck:make
  18 iterations — each validated against
  the spec before committing

  CAVEKIT COMPLETE

Every requirement traced. Every criterion checked.

Same feature. Zero guesswork. Full traceability.

The Problem

AI coding agents are powerful, but they fail the same way every time:

Failure	What Happens
Context loss	Agent forgets what it said three steps ago
No validation	Code written, never verified against intent
No parallelism	One agent, one task, one branch — even when work is independent
No iteration	Single pass produces a rough draft, not production code

Cavekit fixes all four.

The Idea

Instead of "prompt and pray," Cavekit puts a specification layer between your intent and the code.

                        ┌─── Task 1 ─── Agent A ───┐
                        │                           │
You ── /ck:sketch ──► Kits ── /ck:map ──► Build Site ──┤─── Task 2 ─── Agent B ───┤──► done
                        │                           │
                        └─── Task 3 ─── Agent C ───┘

Kits are the source of truth. Agents read them, build from them, validate against them. When something breaks, the system traces the failure back to the kit — not the code.

Spec is the product. Code is the derivative.

Install

git clone https://github.com/JuliusBrussee/cavekit.git ~/.cavekit
cd ~/.cavekit && ./install.sh

Registers the plugin with Claude Code, syncs into Codex marketplace, installs the cavekit CLI. Restart Claude Code after installing.

Requires: Claude Code, git, macOS/Linux.

Optional: Codex (npm install -g @openai/codex) — adds adversarial review. Cavekit works without it. Codex makes it significantly harder to ship flawed specs and broken code.

How It Works

Four phases. Each one a slash command.

  RESEARCH         DRAFT            ARCHITECT           BUILD              INSPECT
  ────────         ─────            ─────────           ─────              ───────
  (optional)       "What are we     Break into tasks,   Auto-parallel:     Gap analysis:
  Multi-agent       building?"      map dependencies,    /ck:make          built vs.
  codebase +                        organize into        groups work        intended.
  web research     Produces:        tiered build site    into adaptive      Peer review.
                   kits with        + dependency graph   subagent packets   Trace to specs.
  Produces:        R-numbered                            tier by tier
  research brief   requirements     Produces:                               Produces:
                                    task graph           Codex reviews      findings report
                   Codex challenges                      every tier gate
                   the design

0. Research — ground the design (optional)

/ck:research "build a C+ compiler"

Dispatches 2–8 parallel subagents to explore the codebase and search the web for best practices, library landscape, reference implementations, and common pitfalls. A synthesizer agent cross-validates findings and produces a research brief in context/refs/.

/ck:design — establish the design system

/ck:design

Creates or imports a DESIGN.md design system — a cross-cutting constraint layer enforced across the entire pipeline. Every kit references its design tokens, every task carries a Design Ref, every build result is audited for violations.

Sub-command	What it does
`/ck:design create`	Generate new DESIGN.md via guided Q&A
`/ck:design import`	Extract DESIGN.md from existing codebase
`/ck:design audit`	Check implementation against DESIGN.md
`/ck:design update`	Revise DESIGN.md, log to changelog

1. Draft — define the what

/ck:sketch

Describe what you're building in natural language. Cavekit decomposes it into domain kits — structured documents with numbered requirements (R1, R2, ...) and testable acceptance criteria. Stack-independent. Human-readable.

After internal review, kits go to Codex for a design challenge — adversarial review that catches decomposition flaws, missing requirements, and ambiguous criteria before any code is written.

For existing codebases: /ck:sketch --from-code reverse-engineers kits from your code and identifies gaps.

2. Architect — plan the order

/ck:map

Reads all kits. Breaks requirements into tasks. Maps dependencies. Organizes into a tiered build site — a dependency graph where Tier 0 has no deps, Tier 1 depends only on Tier 0, and so on. Includes a Coverage Matrix mapping every acceptance criterion to its task(s). Nothing specified gets lost in translation.

3. Build — run the loop

/ck:make

Pre-flight coverage check validates all acceptance criteria are covered. Then the loop runs:

  ┌──────────────────────────────────────────────────────┐
  │                                                      │
  │  Read build site → Find next unblocked task          │
  │       │                                              │
  │       ▼                                              │
  │  Load relevant kit + acceptance criteria             │
  │       │                                              │
  │       ▼                                              │
  │  Implement the task                                  │
  │       │                                              │
  │       ▼                                              │
  │  Validate (build + tests + acceptance criteria)      │
  │       │                                              │
  │       ├── PASS → commit → mark done → next ──┐      │
  │       │                                       │      │
  │       └── FAIL → diagnose → fix → revalidate  │      │
  │                                               │      │
  │  ◄────────────────────────────────────────────┘      │
  │                                                      │
  │  Loop until: all tasks done OR limit reached         │
  └──────────────────────────────────────────────────────┘

At every tier boundary, Codex adversarial review gates advancement. P0/P1 findings must be fixed before the next tier starts. With speculative review (default), this adds near-zero latency.

Post-flight verification cross-references what was built against original kits. Gaps get remediation tasks.

4. Inspect — verify the result

/ck:check

Gap analysis: built vs. specified. Peer review: bugs, security, missed requirements. Everything traced back to kit requirements.

Quick Start

Greenfield:

> /ck:sketch
What are you building?

> A REST API for task management. Users, projects, tasks
  with priorities and due dates. PostgreSQL.

Created 4 kits (22 requirements, 69 acceptance criteria)
Next: /ck:map

> /ck:map
Generated build site: 34 tasks, 5 tiers
Next: /ck:make

> /ck:make
Loop activated — 34 tasks, 20 max iterations.
...
All tasks done. Build passes. Tests pass.
CAVEKIT COMPLETE — 34 tasks in 18 iterations.

Existing codebase:

> /ck:sketch --from-code
Exploring codebase... Next.js 14, Prisma, NextAuth.
Created 6 kits — 4 requirements are gaps (not yet implemented).

> /ck:map --filter collaboration
Generated build site: 8 tasks, 3 tiers

> /ck:make
CAVEKIT COMPLETE — 8 tasks in 8 iterations.

See example.md for full annotated sessions.

Parallel Execution

/ck:make parallelizes automatically. Multiple ready tasks get grouped into coherent work packets and dispatched concurrently.

═══ Wave 1 ═══
3 task(s) ready:
  T-001: Database schema       (tier 0, deps: none)
  T-002: Auth middleware        (tier 0, deps: none)
  T-003: Config loader          (tier 0, deps: none)

Dispatching 2 grouped subagents...
All 3 tasks complete. Merging...

═══ Wave 2 ═══
2 task(s) ready:
  T-004: User endpoints         (tier 1, deps: T-001, T-002)
  T-005: Health check           (tier 1, deps: T-003)

Dispatching 2 grouped subagents...
All done.

═══ BUILD COMPLETE ═══
Waves: 2 | Tasks: 5/5

Step	What happens
Compute frontier	Find all tasks whose dependencies are complete
Group	Bundle frontier into work packets by shared files, subsystem, task size
Dispatch	Run packets as parallel subagents
Merge	Collect results, compute next frontier
Repeat	Wave-by-wave until all tasks done — no manual intervention

Circuit breakers prevent infinite loops: 3 test failures → task BLOCKED, all blocked → stop and report.

Codex Adversarial Review

Cavekit uses Codex as an adversarial reviewer — a second model with different training and different blind spots. Catches things Claude cannot see in its own output. Operates at three levels:

Design Challenge — catch spec flaws before building

After kits are drafted and internally reviewed, the full set goes to Codex:

Claude drafts            Kit set             Codex challenges         User reviews
kits ──────► reviewer approves ──────► the design ──────► kits + findings

Finding type	Behavior
Critical	Must fix before building. Auto-fix loop, up to 2 cycles
Advisory	Presented alongside kits at user review gate

No implementation feedback allowed. No framework suggestions. Only design-level concerns that would cause real problems during build.

Tier Gate — catch code defects between tiers

Every completed tier triggers a Codex code review before advancing:

═══ Tier 0 Complete ═══
Codex reviews diff (T-001, T-002, T-003) ...
Review: 2 findings (1 P0, 1 P3)
Gate: BLOCKED → fix cycle 1/2

Fixing P0: nil pointer in auth middleware ...
Re-review ...
Gate: PROCEED

═══ Tier 1 starting ═══

Severity	Behavior
P0 (critical)	Blocks advancement. Auto-generates fix task
P1 (high)	Blocks advancement. Auto-generates fix task
P2 (medium)	Logged, does not block
P3 (low)	Logged, does not block

Gate modes: severity (default — P0/P1 block), strict (all block), permissive (nothing blocks), off.

Fix cycle runs up to 2 iterations per tier. After that, advances with warning. Never deadlocks.

Speculative Review — eliminate gate latency

Codex reviews the previous tier in the background while Claude builds the current tier:

Tier 0 complete ───────────────────────────► Tier 1 complete
     │                                            │
     └── Codex reviews Tier 0 (background) ──────►│
                                                   │
                         Results ready ◄───────────┘
                         before gate runs

Results are already available when the gate checks. Near-zero latency. Falls back to synchronous if needed.

Command Safety Gate

PreToolUse hook intercepts every Bash command before execution:

Agent runs command
     │
     ▼
Fast-path check ──► allowlist (50+ safe commands) → approve
     │           └► blocklist (rm -rf, force push, DROP TABLE) → block
     │
     ▼ (ambiguous)
Codex classifies ──► safe / warn / block
     │
     ▼ (cached)
Verdict cache ──► normalized pattern → reuse verdict

Integrates with Claude Code's permission system. Cached per session. Falls back to static rules when Codex is unavailable — never blocks solely because classifier is unreachable.

Graceful Degradation

All Codex features are additive. Without Codex installed:

Feature	Fallback
Design challenge	Skipped — internal reviewer still runs
Tier gate	Skipped — build proceeds without review pauses
Command gate	Static allowlist/blocklist only

Cavekit works the same. Codex makes it harder to ship bad specs and bad code.

Configuration

Settings live in two places:

Location	Scope
`~/.cavekit/config`	User default
`.cavekit/config`	Project override (takes precedence)

Setting	Values	Default	Purpose
`bp_model_preset`	`expensive` `quality` `balanced` `fast`	`quality`	Model selection for Cavekit commands
`codex_review`	`auto` `off`	`auto`	Enable/disable Codex reviews
`codex_model`	model string	(Codex default)	Model for Codex calls
`tier_gate_mode`	`severity` `strict` `permissive` `off`	`severity`	How findings gate tier advancement
`command_gate`	`all` `interactive` `off`	`all`	Which sessions get command gating
`command_gate_timeout`	milliseconds	`3000`	Codex safety classification timeout
`speculative_review`	`on` `off`	`on`	Background review of previous tier
`speculative_review_timeout`	seconds	`300`	Max wait for speculative results
`caveman_mode`	`on` `off`	`on`	Token-compressed output (~75% savings)
`caveman_phases`	comma-separated	`build,inspect`	Which phases use caveman-speak

Model presets:

Preset	Reasoning	Execution	Exploration
`expensive`	`opus`	`opus`	`opus`
`quality`	`opus`	`opus`	`sonnet`
`balanced`	`opus`	`sonnet`	`haiku`
`fast`	`sonnet`	`sonnet`	`haiku`

/ck:config                      # show current
/ck:config preset balanced      # change preset
/ck:config preset fast --global # change default

Commands

Claude Code

Command	Phase	What it does
`/ck:research`	Research	Multi-agent codebase + web research, produces brief
`/ck:design`	Design	Create, import, audit, or update DESIGN.md
`/ck:sketch`	Draft	Decompose requirements into domain kits
`/ck:map`	Architect	Generate tiered build site from kits
`/ck:make`	Build	Auto-parallel build with validation loop
`/ck:check`	Inspect	Gap analysis + peer review against kits
`/ck:config`	—	Show or update execution preset
`/ck:judge`	—	Standalone Codex adversarial review on diff
`/ck:progress`	—	Check build site progress
`/ck:scan`	—	Compare built vs. intended
`/ck:revise`	—	Trace manual fixes back into kits
`/ck:help`	—	Usage guide

CLI

Command	What it does
`cavekit monitor`	Interactive launcher — pick build sites, launch in tmux
`cavekit status`	Show build site progress
`cavekit kill`	Stop all sessions, clean up worktrees
`cavekit version`	Print version
`cavekit debug`	Show state file path and version
`cavekit reset`	Clear persisted state

File Structure

context/
├── kits/                     # Domain kits (persist across cycles)
│   ├── kit-overview.md
│   └── kit-{domain}.md
├── designs/                  # Design system artifacts
│   ├── DESIGN.md
│   └── design-changelog.md
├── sites/                    # Build sites (one per plan)
│   └── build-site-*.md
├── impl/                     # Implementation tracking
│   ├── impl-{domain}.md
│   ├── impl-review-findings.md
│   ├── impl-speculative-log.md
│   └── loop-log.md
└── refs/                     # Research briefs + raw findings

Methodology

Cavekit applies the scientific method to AI-generated code. LLMs are non-deterministic. Software engineering doesn't have to be.

Concept	Role
Kits	The hypothesis — what you expect the software to do
Validation gates	Controlled conditions — build, tests, acceptance criteria
Convergence loops	Repeated trials — iterate until stable
Implementation tracking	Lab notebook — what was tried, what worked, what failed
Revision	Update the hypothesis — trace bugs back to kits

Ships with 9 specialized agents (including design-reviewer for UI validation against DESIGN.md), a multi-agent research system, and 15 skills covering the full methodology. With Codex, operates as a dual-model architecture — Claude builds, Codex reviews — catching errors single-model self-review cannot.

All 16 skills

Skill	What it covers
Design System	Create and maintain DESIGN.md
UI Craft	Component patterns, animation, accessibility, review checklist
Cavekit Writing	Write kits agents can consume
Convergence Monitoring	Detect when iterations plateau
Peer Review	Six modes for cross-model review
Validation-First Design	Every requirement must be verifiable
Context Architecture	Progressive disclosure for agent context
Revision	Trace bugs upstream to kits
Brownfield Adoption	Add Cavekit to existing codebases
Speculative Pipeline	Overlap phases for faster builds
Prompt Pipeline	Design the prompts driving each phase
Implementation Tracking	Living records of build progress
Documentation Inversion	Docs for agents, not just humans
Peer Review Loop	Combine build loop with cross-model review
Core Methodology	The full Hunt lifecycle
Caveman	Token-compressed output (~75% savings), built-in for build/inspect phases

Why "Cavekit"

Most AI coding tools treat the agent as a black box. Prompt, generate, hope. Cavekit inverts this.

The spec is the product. The code is a derivative.

When the spec is clear, the code follows. When the code is wrong, the spec tells you why. Without a specification, there's nothing to validate against. Cavekit gives every agent — current and future — a contract to build from and a standard to meet.

Two models disagreeing is a signal. Two models agreeing is confidence.

Star This Repo

If cavekit save you mass debug time — leave star.

Also by Julius Brussee

Caveman — Claude Code skill that cuts ~75% of output tokens. Same accuracy, way less fluff. Bundled in Cavekit and enabled by default for build/inspect phases. Standalone install: npx skills add JuliusBrussee/caveman
Revu — local-first macOS study app with FSRS spaced repetition, decks, exams, and study guides. revu.cards

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.cavekit		.cavekit
.codex-plugin		.codex-plugin
.github		.github
agents		agents
cmd/cavekit		cmd/cavekit
commands		commands
context		context
internal		internal
references		references
scripts		scripts
skills		skills
.codexignore		.codexignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
TODO.md		TODO.md
cavekit		cavekit
example.md		example.md
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
plugin.json		plugin.json

Folders and files

Latest commit

History

Repository files navigation

cavekit

Before / After

Without Cavekit

With Cavekit

The Problem

The Idea

Install

How It Works

0. Research — ground the design (optional)

/ck:design — establish the design system

1. Draft — define the what

2. Architect — plan the order

3. Build — run the loop

4. Inspect — verify the result

Quick Start

Parallel Execution

Codex Adversarial Review

Design Challenge — catch spec flaws before building

Tier Gate — catch code defects between tiers

Speculative Review — eliminate gate latency

Command Safety Gate

Graceful Degradation

Configuration

Commands

Claude Code

CLI

File Structure

Methodology

Why "Cavekit"

Star This Repo

Also by Julius Brussee

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages