Skip to content

alexfleetcommander/smokehouse-eval

Repository files navigation

BBQ Benchmark

Competition-judging framework for AI model evaluation.

The BBQ Benchmark applies battle-tested competition BBQ judging mechanics — double-blind evaluation, drop-score outlier removal, weighted multi-dimensional scoring, and diverse judge panels — to AI model evaluation.

It is not a new set of test questions. It is a judging framework that wraps around any evaluation tasks and produces scores that resist the gaming, bias, and saturation plaguing current benchmarks.

Why BBQ?

The Kansas City Barbeque Society has been running fair evaluations under adversarial conditions since 1986. Their protocol solves exactly the problems AI benchmarks have:

Problem Standard Benchmarks BBQ Benchmark
Judge bias Single evaluator or simple majority 5 diverse judge personas with drop-scoring
Gaming/contamination Static questions, easily memorized Mutation testing detects memorization
Score saturation GSM8K at 98%+ — can't differentiate 4-dimension scoring reveals capability profiles
Style over substance Length/formatting bias in LLM judges Correctness weighted at 44%, Presentation at 11%
Single dimension One number per benchmark 24-number profile (4 dimensions x 6 categories)

Quick Start

# Clone and run
git clone https://github.com/yourusername/bbq-benchmark.git
cd bbq-benchmark

# Quick evaluation (3 judges, ~15 min)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --mode quick

# Standard evaluation (5 judges, ~2 hrs)
python bbq_benchmark.py --tasks tasks_sample.json --model sonnet --judges 5

# Full multi-model comparison with mutations
python bbq_benchmark.py --tasks tasks_sample.json --mode full

# View results from a previous run
python bbq_benchmark.py --report bbq_results/results.json

Requirements

  • Python 3.10+
  • Claude Code CLI installed and authenticated
  • No additional Python dependencies (stdlib only)

How It Works

1. Task Loader

Reads evaluation tasks from a JSON file. Each task has a prompt, reference answer, category, and scoring notes.

2. Model Runner

Sends each task to a model via claude -p CLI calls with clean context. Captures the response, extracts structured answer/reasoning/confidence.

3. Judge Panel

Each response is evaluated by 5 independent judge calls (separate clean-context CLI invocations). Each judge scores on 4 dimensions using a different evaluation persona:

  • Strict Accuracy — facts-first, skeptical by default
  • Process-Oriented — reasoning chain quality over final answer
  • Adversarial — hunts for weaknesses and edge cases
  • Practical Utility — "would this actually help someone?"
  • Balanced Holistic — calibration anchor

4. Drop-Score Calculator

Removes the highest and lowest judge scores (outlier trimming). Averages the remaining 3.

5. Weighted Aggregator

Applies BBQ-style dimension weights:

Dimension Weight % of Total BBQ Analogue
Correctness 4.0 44.4% Taste
Reasoning Quality 2.5 27.8% Tenderness
Robustness 1.5 16.7% Texture
Presentation 1.0 11.1% Appearance

6. Comparison Reporter

Outputs BBQ scores alongside public benchmark scores for the same model.

Adding Your Own Tasks

Edit tasks_sample.json or create a new task file:

{
  "tasks": [
    {
      "task_id": "MY-001",
      "category": "analytical_reasoning",
      "source": "custom",
      "difficulty": "standard",
      "prompt": "Your evaluation prompt here",
      "reference_answer": "The expected correct answer",
      "scoring_notes": "Special guidance for judges",
      "has_executable_test": false,
      "test_code": null
    }
  ]
}

Categories

  • analytical_reasoning — logic, math, causal inference
  • code_generation — writing and fixing code
  • knowledge_synthesis — cross-domain insight
  • communication — clear, audience-adapted writing
  • adversarial_robustness — trick questions, edge cases
  • calibration — self-awareness of knowledge limits

Executable Tests

For code tasks, set has_executable_test: true and provide test_code with Python assert statements. The benchmark will run the model's code against your tests.

CLI Reference

python bbq_benchmark.py [OPTIONS]

Options:
  --tasks FILE          Task bank JSON file (default: tasks_sample.json)
  --model MODEL [...]   Models to evaluate: opus, sonnet, haiku
  --mode {quick,standard,full}
                        Preset configuration
  --judges N            Judge calls per response (3-7, default: 5)
  --judge-model MODEL   Model for judge calls (default: claude-sonnet-4-6)
  --mutations           Enable mutation testing for contamination detection
  --seed N              Random seed for reproducibility (default: 42)
  --output-dir DIR      Output directory (default: ./bbq_results)
  --resume FILE         Resume from checkpoint
  --report FILE         Pretty-print previous results
  --verbose             Detailed progress output

Mode Presets

Mode Judges Default Models Mutations Est. Time (15 tasks)
quick 3 sonnet off ~15 min
standard 5 sonnet off ~1-2 hrs
full 5 opus, sonnet, haiku on ~4-6 hrs

Output

Results are written to bbq_results/ (configurable):

  • results.json — Full structured results with per-task scores
  • comparison_table.md — BBQ scores vs. public benchmark scores
  • anti_gaming_analysis.md — Drop-score impact and mutation analysis
  • bbq_benchmark_report.md — Complete narrative report

Interpreting Results

BBQ Score

A normalized 0-100 score derived from weighted multi-dimensional judging with outlier removal. Not directly comparable to standard benchmark percentages — it measures different things.

Capability Profiles

The 4-dimension breakdown reveals patterns single scores hide:

  • High Correctness + Low Reasoning = possible memorization
  • High Presentation + Low Robustness = polished but brittle
  • High Calibration scores = the model knows what it doesn't know

Drop-Score Impact

If removing the highest/lowest judges changes model rankings, the outlier removal is doing meaningful work.

Methodology

See docs/methodology.md for the full protocol specification, including:

  • Detailed scoring rubrics
  • Judge persona definitions
  • Drop-score mathematics
  • Anti-gaming mechanisms
  • Category definitions

License

MIT License. See LICENSE.

Contributing

  1. Add tasks — submit PRs with new task JSON files
  2. Add categories — extend the category system for domain-specific evaluation
  3. Add judge personas — define new evaluation perspectives
  4. Improve analysis — enhance the reporting and visualization pipeline

Built by AB Support. The BBQ Benchmark is part of our work on AI trust infrastructure.

About

Smokehouse Eval — Competition-judging framework for AI model evaluation. Multi-judge adversarial scoring with BBQ drop-scoring mechanics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages