Phase 0 benchmark results validating the core thesis: does multi-model consensus produce better answers than a single model?
50 questions across 5 categories, 4 methods compared:
| Method | Description | Models |
|---|---|---|
| (A) Direct | Single model, direct answer | Sonnet |
| (B) Self-debate | Same model proposes, critiques itself, synthesizes | Sonnet |
| (C) Consensus | Claude proposes, GPT challenges (forced disagreement), Claude revises | Sonnet + GPT-4o |
| (D) Ensemble | 3 parallel samples synthesized | Sonnet |
Evaluation used blind LLM-as-judge with two independent judges (GPT-4o + Sonnet).
The consensus method (C) consistently outperformed all other approaches:
- Consensus vs Direct: Consensus produced more thorough, nuanced answers with better coverage of trade-offs and practical considerations
- Consensus vs Self-debate: Cross-model challenge found genuine blind spots that self-critique missed
- Consensus vs Ensemble: Forced disagreement was more effective than statistical averaging of similar responses
- Cross-model challenges work -- GPT finds real flaws in Claude's answers (and vice versa) that self-critique misses
- Forced disagreement is essential -- Without explicit instructions to disagree, challengers tend toward sycophantic agreement
- Revision quality depends on challenge quality -- Genuine challenges produce better revisions than polite suggestions
- Cost is reasonable -- Consensus costs ~3x a direct answer but the quality improvement justifies it for important questions
The benchmark established clear thresholds:
- PROCEED: Consensus (C) clearly beats Direct (A) on judgment/strategy (>60% win rate)
- ITERATE: Consensus only marginally beats Self-Debate (B) -- refine prompts, re-test
- STOP: Consensus consistently loses -- thesis invalidated
Result: PROCEED -- consensus demonstrated clear advantages, especially on questions requiring nuanced judgment, risk assessment, and multi-perspective analysis.
The benchmark suite is in the phase0/ directory:
uv sync
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
# Pilot run (5 questions)
uv run python -m phase0.runner --pilot
# Full benchmark (50 questions)
uv run python -m phase0.runner
# Judge results
uv run python -m phase0.judge
# Generate report
uv run python -m phase0.analyze- How Consensus Works -- The protocol built from these findings
- Getting Started -- Try consensus yourself