Releases: Whisker17/claude-codex-loop
v2.3.1
Changes
- Add root
.claude-plugin/plugin.jsonmanifest required for marketplace submission - Sync plugin version to 2.3.1 across all manifests
- Fix old
plugins/review-loop/path references in design specs - Remove external tooling artifacts (
docs/superpowers/plans/) - Update README with version history and marketplace info
v2.3.0: Autooptimize Skill & Test Harness
What's New
Autooptimize Skill (.claude/skills/autooptimize/)
Generalized version of Karpathy's autoresearch methodology for optimizing any project artifact — prompts, configs, orchestration logic, or code. Runs autonomous experiment loops: execute → eval → mutate → keep/discard.
Test Harness (autooptimize-harness/)
Comprehensive test infrastructure for measuring review-loop plugin effectiveness:
- Single-prompt testing — Test individual prompts (design-review, code-review, code-implement) via
codex execwith fixtures containing known defects. ~1-5 min per run. - End-to-end pipeline testing — Run the full review-loop via
claude -pin isolated temp dirs, then execute produced code and tests automatically. - LLM-as-judge eval — Binary pass/fail scoring using Claude as evaluator.
- 3 test scenarios — CLI calculator (simple), rate limiter (medium), KV store (complex).
- Fixtures — Design docs, code samples, and specs with known subtle defects for evaluating review quality.
Baseline Evaluation Results
| Scenario | Duration | Design Rounds | Code Rounds | Tests | Status |
|---|---|---|---|---|---|
| CLI calculator | 6 min | 1 | 1 | 22/22 | PASS |
| Rate limiter | 95 min | 6 | 2 | 32/32 | PASS |
| KV store | ~10 hr | 6 | 2 | 88/88 | PASS |
Key findings:
- All review prompts score 100% on known-issue detection — prompt quality is already strong
- E2E pipeline produces correct, tested code across all complexity levels (142 total tests passing)
- Design stage convergence speed is the primary optimization target for complex tasks
review-loop v2.2.0: Independent Validation Round
What's New
Independent Validation Round
After the existing Claude-Codex review loop converges (or exhausts its 5 rounds), a fresh Codex instance with zero shared review history re-examines the output using blind-spot-focused prompts. If issues are found, they're fed back into the regular loop for fix + re-validation (max 2 cycles per stage).
This addresses the false convergence problem: Claude and Codex agree there are no remaining issues, but independent review still finds entirely new categories of problems — shared blind spots from operating under the same prompt framework.
New Modes
| Mode | Role | Purpose |
|---|---|---|
independent-design-review |
Codex (READ-ONLY) | Blind-spot review of design docs |
independent-code-review |
Codex (READ-ONLY) | Blind-spot review of code changes |
validation-design-fix |
Codex (READ-ONLY) | Review design fixes after validation |
validation-fix |
Codex (implementer) | Fix code issues found by validation |
Key Details
- Context isolation: prompt-assembly layer injects zero review history; prompt-instruction layer explicitly prohibits reading
specs/reviews/and.claude/ - Composite round tokens:
c1/c2for validation cycles,c1f1/c2f2for fix rounds — no artifact naming collisions - Failure handling: retry once on timeout/failure, skip + log on double failure, surface in final output (never falsely claim validation passed)
- Ordering: validation runs before verify round (verify remains the terminal pass)
- Role separation: AGENTS.md distinguishes independent reviewers, design fix reviewers, and code fixers with distinct constraints
- 20 new tests (37 total), all passing
File Changes
- 4 new prompt templates in
review-loop/prompts/ - Modified:
common.sh(4 new modes, composite round validation,append_diff_sectionwith temp git index),review-loop.md(validation + verify + output flows),AGENTS.md(role separation),tests/review-loop.test.sh - No changes to
run-review-bg.sh,check-review.sh,kill-review.sh, hooks, or state file schema
See specs/design.md for the full implementation design.
v2.1.0
review-loop v2.1.0
Addresses two production issues discovered in v2:
Fresh Independent Reviews
- Every review round now mandates a full audit of the entire document/diff
- Verify rounds strip all prior review context and prepend a "FULL INDEPENDENT REVIEW" header
- Eliminates attention narrowing where successive rounds focused only on previously found issues
Optional Brainstorming Stage
- If
superpowers:brainstormingis available, user is asked whether to brainstorm before designing - Output saved to
specs/brainstorm.mdas supplementary context (task description stays authoritative) - Session-scoped via
brainstorm_doneflag — stale artifacts from previous sessions are ignored - Brainstorming suppressed in all subsequent Codex prompts
Improved Cancellation
- Records
start_branchandstart_shain state file - Cancel restores starting branch/commit and deletes session branch
- Handles detached HEAD correctly
git reset -- . && git checkout -- . && git clean -fdclears staged + unstaged + untracked
Other Changes
specs/brainstorm.mdadded to code-stage protected paths and staging exclusions- Code-stage verify is Claude-only (no Codex invocation)
- Updated tests (18 total, all passing)
- Updated AGENTS.md for three-stage workflow
Design Review Audit Trail
The implementation spec went through 5 rounds of Codex design review + a verification pass. All review artifacts are preserved in specs/reviews/design/.
Full Changelog: 6630b26...v2.1.0