BBQ Benchmark

Competition-judging framework for AI model evaluation.

The BBQ Benchmark applies battle-tested competition BBQ judging mechanics — double-blind evaluation, drop-score outlier removal, weighted multi-dimensional scoring, and diverse judge panels — to AI model evaluation.

It is not a new set of test questions. It is a judging framework that wraps around any evaluation tasks and produces scores that resist the gaming, bias, and saturation plaguing current benchmarks.

Why BBQ?

The Kansas City Barbeque Society has been running fair evaluations under adversarial conditions since 1986. Their protocol solves exactly the problems AI benchmarks have:

Problem	Standard Benchmarks	BBQ Benchmark
Judge bias	Single evaluator or simple majority	5 diverse judge personas with drop-scoring
Gaming/contamination	Static questions, easily memorized	Mutation testing detects memorization
Score saturation	GSM8K at 98%+ — can't differentiate	4-dimension scoring reveals capability profiles
Style over substance	Length/formatting bias in LLM judges	Correctness weighted at 44%, Presentation at 11%
Single dimension	One number per benchmark	24-number profile (4 dimensions x 6 categories)

Quick Start

# Clone and run
git clone https://github.com/yourusername/bbq-benchmark.git
cd bbq-benchmark

# Quick evaluation (3 judges, ~15 min)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --mode quick

# Standard evaluation (5 judges, ~2 hrs)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --judges 5

# Full multi-model comparison with mutations
python bbq_benchmark.py --tasks tasks_sample.json --mode full

# View results from a previous run
python bbq_benchmark.py --report bbq_results/results.json

Requirements

Python 3.10+
Claude Code CLI installed and authenticated
No additional Python dependencies (stdlib only)

How It Works

1. Task Loader

Reads evaluation tasks from a JSON file. Each task has a prompt, reference answer, category, and scoring notes.

2. Model Runner

Sends each task to a model via claude -p CLI calls with clean context. Captures the response, extracts structured answer/reasoning/confidence.

3. Judge Panel

Each response is evaluated by 5 independent judge calls (separate clean-context CLI invocations). Each judge scores on 4 dimensions using a different evaluation persona:

Strict Accuracy — facts-first, skeptical by default
Process-Oriented — reasoning chain quality over final answer
Adversarial — hunts for weaknesses and edge cases
Practical Utility — "would this actually help someone?"
Balanced Holistic — calibration anchor

4. Drop-Score Calculator

Removes the highest and lowest judge scores (outlier trimming). Averages the remaining 3.

5. Weighted Aggregator

Applies BBQ-style dimension weights:

Dimension	Weight	% of Total	BBQ Analogue
Correctness	4.0	44.4%	Taste
Reasoning Quality	2.5	27.8%	Tenderness
Robustness	1.5	16.7%	Texture
Presentation	1.0	11.1%	Appearance

6. Comparison Reporter

Outputs BBQ scores alongside public benchmark scores for the same model.

Adding Your Own Tasks

Edit tasks_sample.json or create a new task file:

{
  "tasks": [
    {
      "task_id": "MY-001",
      "category": "analytical_reasoning",
      "source": "custom",
      "difficulty": "standard",
      "prompt": "Your evaluation prompt here",
      "reference_answer": "The expected correct answer",
      "scoring_notes": "Special guidance for judges",
      "has_executable_test": false,
      "test_code": null
    }
  ]
}

Executable Tests

For code tasks, set has_executable_test: true and provide test_code with Python assert statements. The benchmark will run the model's code against your tests.

CLI Reference

python bbq_benchmark.py [OPTIONS]

Options:
  --tasks FILE          Task bank JSON file (default: tasks_sample.json)
  --model MODEL [...]   Models to evaluate: opus, sonnet, haiku
  --mode {quick,standard,full}
                        Preset configuration
  --judges N            Judge calls per response (3-7, default: 5)
  --judge-model MODEL   Model for judge calls (default: claude-sonnet-4-6)
  --mutations           Enable mutation testing for contamination detection
  --seed N              Random seed for reproducibility (default: 42)
  --output-dir DIR      Output directory (default: ./bbq_results)
  --resume FILE         Resume from checkpoint
  --report FILE         Pretty-print previous results
  --verbose             Detailed progress output

Mode Presets

Mode	Judges	Default Models	Mutations	Est. Time (15 tasks)
`quick`	3	sonnet	off	~15 min
`standard`	5	sonnet	off	~1-2 hrs
`full`	5	opus, sonnet, haiku	on	~4-6 hrs

Output

Results are written to bbq_results/ (configurable):

results.json — Full structured results with per-task scores
comparison_table.md — BBQ scores vs. public benchmark scores
anti_gaming_analysis.md — Drop-score impact and mutation analysis
bbq_benchmark_report.md — Complete narrative report

Interpreting Results

BBQ Score

A normalized 0-100 score derived from weighted multi-dimensional judging with outlier removal. Not directly comparable to standard benchmark percentages — it measures different things.

Capability Profiles

The 4-dimension breakdown reveals patterns single scores hide:

High Correctness + Low Reasoning = possible memorization
High Presentation + Low Robustness = polished but brittle
High Calibration scores = the model knows what it doesn't know

Drop-Score Impact

If removing the highest/lowest judges changes model rankings, the outlier removal is doing meaningful work.

Methodology

See docs/methodology.md for the full protocol specification, including:

Detailed scoring rubrics
Judge persona definitions
Drop-score mathematics
Anti-gaming mechanisms
Category definitions

License

MIT License. See LICENSE.

Contributing

Add tasks — submit PRs with new task JSON files
Add categories — extend the category system for domain-specific evaluation
Add judge personas — define new evaluation perspectives
Improve analysis — enhance the reporting and visualization pipeline

Built by AB Support. The BBQ Benchmark is part of our work on AI trust infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bbq_results		bbq_results
docs		docs
LICENSE		LICENSE
README.md		README.md
bbq_benchmark.py		bbq_benchmark.py
convert_humaneval.py		convert_humaneval.py
public_scores.json		public_scores.json
requirements.txt		requirements.txt
tasks_sample.json		tasks_sample.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BBQ Benchmark

Why BBQ?

Quick Start

Requirements

How It Works

1. Task Loader

2. Model Runner

3. Judge Panel

4. Drop-Score Calculator

5. Weighted Aggregator

6. Comparison Reporter

Adding Your Own Tasks

Categories

Executable Tests

CLI Reference

Mode Presets

Output

Interpreting Results

BBQ Score

Capability Profiles

Drop-Score Impact

Methodology

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BBQ Benchmark

Why BBQ?

Quick Start

Requirements

How It Works

1. Task Loader

2. Model Runner

3. Judge Panel

4. Drop-Score Calculator

5. Weighted Aggregator

6. Comparison Reporter

Adding Your Own Tasks

Categories

Executable Tests

CLI Reference

Mode Presets

Output

Interpreting Results

BBQ Score

Capability Profiles

Drop-Score Impact

Methodology

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages