harsh-critic

A thorough review skill for Claude Code that finds what's wrong and what's missing. It uses structured gap analysis, multi-perspective investigation, and evidence requirements to catch issues that standard reviews consistently overlook.

Despite the name, this isn't about being harsh for its own sake — it's about being thorough. What works is structured prompting: a fixed investigation protocol, explicit gap analysis, evidence requirements, and adaptive harshness that escalates when serious problems are found. The result is a reviewer that's precise by default and adversarial when warranted.

Visual protocol explainer — interactive diagram of the full investigation protocol.

The problem with standard reviews

LLM-based reviewers default to evaluating what IS present. They scan through code or plans, comment on what they see, and move on. The result is that entire categories of issues — missing error handling, unstated assumptions, absent edge cases — go unreported. Not because the reviewer can't find them, but because it was never prompted to look.

How it works

The skill uses a 5-phase investigation protocol that forces systematic coverage:

Phase 1: Pre-commitment predictions

Before reading the work in detail, the reviewer predicts the 3-5 most likely problem areas based on the type and domain of the work. This activates deliberate search — instead of passively reading and reacting, the reviewer enters the review with specific hypotheses to confirm or reject.

Phase 2: Verification

Every technical claim is verified against the actual codebase. No assertion is taken on trust. The verification approach adapts to the artifact type:

For code: Execution paths are traced, especially error paths and edge cases. Checks include off-by-one errors, race conditions, missing null checks, incorrect type assumptions, and security oversights.
For plans: A structured 6-step plan investigation runs: key assumptions extraction (rated VERIFIED/REASONABLE/FRAGILE), pre-mortem analysis (5-7 failure scenarios), dependency audit, ambiguity scan, feasibility check, and rollback analysis. For each major decision, a devil's advocate challenge tests whether the strongest counter-argument was addressed.
For analysis: Logical leaps, unsupported conclusions, and assumptions stated as facts are identified.

Phase 3: Multi-perspective review

The work is re-examined from distinct angles adapted to the artifact type:

For code:

Security engineer: What trust boundaries are crossed? What input isn't validated? What could be exploited?
New hire: Could someone unfamiliar with this codebase follow this work? What context is assumed but never stated?
Ops engineer: What happens at scale? Under load? When dependencies fail? What's the blast radius?

For plans:

Executor: Can I do each step with only what's written here? Where will I get stuck?
Stakeholder: Does this actually solve the stated problem? Are success criteria meaningful?
Skeptic: What's the strongest argument this approach will fail?

Phase 4: Gap analysis

This is the core differentiator — and the phase that actually makes the skill useful. The reviewer explicitly asks:

"What would break this?"
"What edge case isn't handled?"
"What assumption could be wrong?"
"What was conveniently left out?"

Phase 4.5: Self-audit

A metacognitive check on all findings before finalizing. Each CRITICAL/MAJOR finding is assessed for confidence level (HIGH/MEDIUM/LOW), refutability, and whether it's a genuine flaw vs. a stylistic preference. Low-confidence and easily refutable findings are moved to Open Questions rather than scored sections. This reduces false positives without suppressing real issues.

Phase 4.75: Realist Check

A pragmatic severity calibration for CRITICAL and MAJOR findings. Each high-severity finding is pressure-tested with four questions: What's the realistic worst case? What mitigating factors exist? How quickly would this be detected? Is momentum bias inflating the severity? Findings where real-world impact doesn't match the label get downgraded. Hard guardrail: data loss, security breaches, and financial impact are never downgraded. Any recalibrations are reported in the Verdict Justification.

Adaptive harshness

The review starts in THOROUGH mode (precise, evidence-driven, measured). If during investigation the reviewer discovers any CRITICAL finding, 3+ MAJOR findings, or a pattern suggesting systemic issues, it escalates to ADVERSARIAL mode — actively hunting for hidden problems, challenging every decision, and expanding scope to adjacent areas. The verdict reports which mode was used and why.

Phase 5: Synthesis

Findings are compared against the pre-commitment predictions, calibrated for severity, and assembled into a structured verdict.

Output format

Every review produces a structured report:

Verdict: REJECT / REVISE / ACCEPT-WITH-RESERVATIONS / ACCEPT
Pre-commitment predictions: What was expected vs. what was found
Critical findings: Blocks execution. Must include evidence (file:line for code, backtick-quoted excerpts for plans). Includes confidence rating.
Major findings: Causes significant rework. Must include evidence. Includes confidence rating.
Minor findings: Suboptimal but functional.
What's missing: Gaps, unhandled edge cases, unstated assumptions.
Ambiguity risks (plan reviews only): Statements with multiple valid interpretations, both readings, and the consequence of choosing wrong.
Multi-perspective notes: Security/new-hire/ops (code) or executor/stakeholder/skeptic (plans) concerns not captured above.
Verdict justification: Why this verdict, what would upgrade it, and whether the review escalated to adversarial mode.
Open questions: Low-confidence findings moved here by self-audit, plus speculative follow-ups.

Calibration guidance prevents both failure modes: rubber-stamping (saying "looks good" without verifying) and manufactured outrage (inventing problems to seem thorough). The self-audit phase gates findings by confidence, reducing false positives without suppressing genuine issues.

Comparison with oh-my-claudecode's built-in critic

oh-my-claudecode (OMC) ships with a critic agent in its review lane. Harsh Critic is a specialized replacement that differs in several ways:

	OMC critic	harsh-critic
Approach	General-purpose critical challenge of plans and designs	5-phase structured investigation protocol
Gap analysis	Not explicitly prompted — relies on the reviewer's natural instinct	Dedicated phase with specific questions ("What would break this?", "What's missing?")
Evidence requirements	No formal requirement	CRITICAL/MAJOR findings must cite `file:line` or backtick-quoted code
Multi-perspective	Single reviewer perspective	Forced re-examination from security, new-hire, and ops angles
Pre-commitment	None — reads and reacts	Predicts problems before reading to activate deliberate search
Output format	Unstructured	Fixed contract with exact section headings (parseable by benchmarks)
Calibration	General "be thorough" guidance	Explicit anti-rubber-stamp AND anti-manufactured-outrage guidance
Verdict scale	No formal scale	4-tier: REJECT / REVISE / ACCEPT-WITH-RESERVATIONS / ACCEPT

Interactive benchmark comparison chart — visual breakdown of baseline vs v1 vs v2 performance.

Benchmark results (8 controlled fixtures on plan/code/analysis), direct against OMC critic:

Run	What changed	Model	harsh-critic composite	OMC critic composite	Delta	Win/Loss/Tie
Initial benchmark (2026-03-03 12:49)	Original parser + scorer	`claude-sonnet-4-6`	22.1%	13.8%	+8.4%	5/1/2
Parser-hardening rerun (2026-03-03 23:35)	Improved parsing for markdown variants	`claude-opus-4-6`	55.9%	7.8%	+48.1%	8/0/0
Scorer-calibration rerun (2026-03-03 23:54)	Calibrated keyword match thresholds	`claude-opus-4-6`	24.7%	13.8%	+10.9%	4/1/3

Run	True Positive Rate (harsh vs critic)	False Negative Rate (harsh vs critic)	False Positive Rate (harsh vs critic)	Missing Coverage (harsh vs critic)	Perspective Coverage (harsh vs critic)	Evidence Rate (harsh vs critic)
Initial benchmark	13.4% vs 0.0%	61.6% vs 75.0%	40.5% vs 0.0%	12.5% vs 0.0%	12.5% vs 0.0%	0.0% vs 0.0%
Parser-hardening rerun	56.4% vs 4.7%	18.6% vs 70.3%	87.1% vs 78.1%	52.1% vs 0.0%	39.6% vs 0.0%	39.4% vs 0.0%
Scorer-calibration rerun	21.1% vs 0.0%	53.9% vs 75.0%	40.6% vs 0.0%	18.8% vs 0.0%	18.8% vs 0.0%	5.0% vs 0.0%

The parser and scorer reruns are isolated experiments (not cumulative) and were run in separate branches of the OMC benchmark harness.

When OMC is installed, the /harsh-critic skill automatically routes through OMC's agent infrastructure for isolation. It uses the harsh-critic protocol regardless of routing.

Research background

A/B tested across 8 controlled trials. The key finding: reviewers given a structured output format with an explicit "What's Missing" section found 33 gap items. Reviewers without it found 0. The structured format is the active ingredient — not reviewer skill, but reviewer prompting. Full analysis: oh-my-claudecode#1240.

Install

As a skill (adds `/harsh-critic` command)

git clone https://github.com/zivtech/harsh-critic.git
cp -r harsh-critic/.claude/skills/harsh-critic ~/.claude/skills/

Or with npx claude-skills:

npx claude-skills add https://github.com/zivtech/harsh-critic

As an agent (adds `harsh-critic` to agent list)

cp harsh-critic/.claude/agents/harsh-critic.md ~/.claude/agents/

Both

Install both for maximum flexibility. The skill gives you the /harsh-critic slash command. The agent gives you a read-only reviewer (Write/Edit tools disabled) that can be invoked by other agents or by name.

Usage

Skill (slash command)

/harsh-critic path/to/plan.md
/harsh-critic src/api/handler.ts
/harsh-critic the code that was just written

Agent (via agent picker or other agents)

The agent at ~/.claude/agents/harsh-critic.md is automatically available in Claude Code's agent system. It runs at Opus tier with Write/Edit tools disabled — it reviews but cannot modify.

When to use it

Right before merge/deploy for risky code paths (auth, payments, sessions, state transitions)
As a second-pass thorough check after a normal review said "looks good"
On implementation-heavy work where concrete file:line evidence can be cited
When you suspect another agent's output has gaps or weak reasoning

When not to use it

Quick sanity checks on trivial changes — just review directly
Low-noise acceptance checks on known-clean artifacts — this workflow can over-flag
When you want code changes made — use an implementation agent instead

Drupal-specific reviews

For Drupal-heavy changes (module updates, cache behavior, config sync flows, migration plans, or contrib patch risk), route through a dedicated drupal-critic skill/agent first, then optionally run harsh-critic as a second-pass thorough check.

Compatibility

Claude Code: Works standalone, no plugins required.
oh-my-claudecode: If installed, the skill routes through OMC's review lane for enhanced isolation. Falls back to general-purpose agent otherwise.

What's included

.claude/
  skills/
    harsh-critic/
      SKILL.md          # Skill definition (adds /harsh-critic slash command)
  agents/
    harsh-critic.md     # Agent prompt (read-only reviewer, Opus tier)

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

harsh-critic

The problem with standard reviews

How it works

Phase 1: Pre-commitment predictions

Phase 2: Verification

Phase 3: Multi-perspective review

Phase 4: Gap analysis

Phase 4.5: Self-audit

Phase 4.75: Realist Check

Adaptive harshness

Phase 5: Synthesis

Output format

Comparison with oh-my-claudecode's built-in critic

Research background

Install

As a skill (adds `/harsh-critic` command)

As an agent (adds `harsh-critic` to agent list)

Both

Usage

Skill (slash command)

Agent (via agent picker or other agents)

When to use it

When not to use it

Drupal-specific reviews

Compatibility

What's included

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

harsh-critic

The problem with standard reviews

How it works

Phase 1: Pre-commitment predictions

Phase 2: Verification

Phase 3: Multi-perspective review

Phase 4: Gap analysis

Phase 4.5: Self-audit

Phase 4.75: Realist Check

Adaptive harshness

Phase 5: Synthesis

Output format

Comparison with oh-my-claudecode's built-in critic

Research background

Install

As a skill (adds /harsh-critic command)

As an agent (adds harsh-critic to agent list)

Both

Usage

Skill (slash command)

Agent (via agent picker or other agents)

When to use it

When not to use it

Drupal-specific reviews

Compatibility

What's included

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

As a skill (adds `/harsh-critic` command)

As an agent (adds `harsh-critic` to agent list)

Packages