Competition-judging framework for AI model evaluation.
The BBQ Benchmark applies battle-tested competition BBQ judging mechanics — double-blind evaluation, drop-score outlier removal, weighted multi-dimensional scoring, and diverse judge panels — to AI model evaluation.
It is not a new set of test questions. It is a judging framework that wraps around any evaluation tasks and produces scores that resist the gaming, bias, and saturation plaguing current benchmarks.
The Kansas City Barbeque Society has been running fair evaluations under adversarial conditions since 1986. Their protocol solves exactly the problems AI benchmarks have:
| Problem | Standard Benchmarks | BBQ Benchmark |
|---|---|---|
| Judge bias | Single evaluator or simple majority | 5 diverse judge personas with drop-scoring |
| Gaming/contamination | Static questions, easily memorized | Mutation testing detects memorization |
| Score saturation | GSM8K at 98%+ — can't differentiate | 4-dimension scoring reveals capability profiles |
| Style over substance | Length/formatting bias in LLM judges | Correctness weighted at 44%, Presentation at 11% |
| Single dimension | One number per benchmark | 24-number profile (4 dimensions x 6 categories) |
# Clone and run
git clone https://github.com/yourusername/bbq-benchmark.git
cd bbq-benchmark
# Quick evaluation (3 judges, ~15 min)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --mode quick
# Standard evaluation (5 judges, ~2 hrs)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --judges 5
# Full multi-model comparison with mutations
python bbq_benchmark.py --tasks tasks_sample.json --mode full
# View results from a previous run
python bbq_benchmark.py --report bbq_results/results.json- Python 3.10+
- Claude Code CLI installed and authenticated
- No additional Python dependencies (stdlib only)
Reads evaluation tasks from a JSON file. Each task has a prompt, reference answer, category, and scoring notes.
Sends each task to a model via claude -p CLI calls with clean context. Captures the response, extracts structured answer/reasoning/confidence.
Each response is evaluated by 5 independent judge calls (separate clean-context CLI invocations). Each judge scores on 4 dimensions using a different evaluation persona:
- Strict Accuracy — facts-first, skeptical by default
- Process-Oriented — reasoning chain quality over final answer
- Adversarial — hunts for weaknesses and edge cases
- Practical Utility — "would this actually help someone?"
- Balanced Holistic — calibration anchor
Removes the highest and lowest judge scores (outlier trimming). Averages the remaining 3.
Applies BBQ-style dimension weights:
| Dimension | Weight | % of Total | BBQ Analogue |
|---|---|---|---|
| Correctness | 4.0 | 44.4% | Taste |
| Reasoning Quality | 2.5 | 27.8% | Tenderness |
| Robustness | 1.5 | 16.7% | Texture |
| Presentation | 1.0 | 11.1% | Appearance |
Outputs BBQ scores alongside public benchmark scores for the same model.
Edit tasks_sample.json or create a new task file:
{
"tasks": [
{
"task_id": "MY-001",
"category": "analytical_reasoning",
"source": "custom",
"difficulty": "standard",
"prompt": "Your evaluation prompt here",
"reference_answer": "The expected correct answer",
"scoring_notes": "Special guidance for judges",
"has_executable_test": false,
"test_code": null
}
]
}analytical_reasoning— logic, math, causal inferencecode_generation— writing and fixing codeknowledge_synthesis— cross-domain insightcommunication— clear, audience-adapted writingadversarial_robustness— trick questions, edge casescalibration— self-awareness of knowledge limits
For code tasks, set has_executable_test: true and provide test_code with Python assert statements. The benchmark will run the model's code against your tests.
python bbq_benchmark.py [OPTIONS]
Options:
--tasks FILE Task bank JSON file (default: tasks_sample.json)
--model MODEL [...] Models to evaluate: opus, sonnet, haiku
--mode {quick,standard,full}
Preset configuration
--judges N Judge calls per response (3-7, default: 5)
--judge-model MODEL Model for judge calls (default: claude-sonnet-4-6)
--mutations Enable mutation testing for contamination detection
--seed N Random seed for reproducibility (default: 42)
--output-dir DIR Output directory (default: ./bbq_results)
--resume FILE Resume from checkpoint
--report FILE Pretty-print previous results
--verbose Detailed progress output
| Mode | Judges | Default Models | Mutations | Est. Time (15 tasks) |
|---|---|---|---|---|
quick |
3 | sonnet | off | ~15 min |
standard |
5 | sonnet | off | ~1-2 hrs |
full |
5 | opus, sonnet, haiku | on | ~4-6 hrs |
Results are written to bbq_results/ (configurable):
results.json— Full structured results with per-task scorescomparison_table.md— BBQ scores vs. public benchmark scoresanti_gaming_analysis.md— Drop-score impact and mutation analysisbbq_benchmark_report.md— Complete narrative report
A normalized 0-100 score derived from weighted multi-dimensional judging with outlier removal. Not directly comparable to standard benchmark percentages — it measures different things.
The 4-dimension breakdown reveals patterns single scores hide:
- High Correctness + Low Reasoning = possible memorization
- High Presentation + Low Robustness = polished but brittle
- High Calibration scores = the model knows what it doesn't know
If removing the highest/lowest judges changes model rankings, the outlier removal is doing meaningful work.
See docs/methodology.md for the full protocol specification, including:
- Detailed scoring rubrics
- Judge persona definitions
- Drop-score mathematics
- Anti-gaming mechanisms
- Category definitions
MIT License. See LICENSE.
- Add tasks — submit PRs with new task JSON files
- Add categories — extend the category system for domain-specific evaluation
- Add judge personas — define new evaluation perspectives
- Improve analysis — enhance the reporting and visualization pipeline
Built by AB Support. The BBQ Benchmark is part of our work on AI trust infrastructure.