Benchmarker

Benchmarker is a simple yet powerful CLI tool for benchmarking LLMs and agentic workflows. It allows you to evaluate language models and execution strategies against custom datasets and scoring rules, making it ideal for LLM research, product evaluations, and performance regressions.

Benchmarker is a Python-based command-line tool that:

Loads benchmark suites defined in YAML
Runs model scenarios using configurable strategies
Scores model responses using a judge model or logic-based scorer
Generates rich output reports and saves detailed traces
NEW: Supports image OCR and handwriting recognition testing

Built with modularity in mind, Benchmarker supports multiple providers, models, scoring strategies, prompt styles, and now multimodal capabilities — making it a flexible backbone for comprehensive LLM evaluations.

💡 Why?

When working with LLMs, we often face a fundamental decision-making problem:

Should I use a powerful model with simple one-shot prompting? Or a lighter-weight model in a multi-agent setup?

This question comes up constantly, whether you’re building small apps, experimenting with new strategies, or running model comparisons. Choosing which model, which prompting strategy, or how to structure agent interactions often becomes guesswork.

Even flagship models behave differently depending on:

Prompting style (one-shot vs few-shot vs programmatic)
Strategy (direct call vs agent collaboration)
Model type (high-parameter “thinking” models like GPT-4-Turbo vs faster lightweight variants like GPT-4o)

There wasn’t a simple way to experiment systematically, measure tradeoffs, and decide with data.

So I built Benchmarker.

⚙️ Setup

VS Code Dev Container Setup (Recommended)

The easiest way to get started with all dependencies pre-configured:

Open in VS Code: Clone the repository and open it in VS Code
Open as Dev Container: When prompted, select "Reopen in Container" - this will automatically build a Docker environment with all dependencies including Python, poppler for PDF processing, and all required libraries
Configure credentials: Copy the environment template and add your API credentials
```
cp .env.template .env
# Edit .env with your actual Azure OpenAI credentials
```

That's it! The dev container handles all the setup automatically.

⚙️ Run

1. Edit the suite (Q&A suite)

# Edit the suite file
nano data/suites/basic.yaml

suite:
  name: "Simple Q&A Benchmark"
  description: "Configuration for running simple Q&A benchmark with Azure OpenAI"

  dataset: "data/datasets/basic.yaml"

  scenarios:
    - name: "qa_test_azure_oneshot"
      strategy: "direct_prompt"
      provider: "azure_openai"

  scoring:
    name: "scoring judge"
    strategy: "llm_judge"
    provider: "azure_openai"
    max_score: 10
    prompt: |
      You are evaluating the quality of an answer to a factual question.

      Question: {question}
      Expected Answer: {expected_answer}
      Actual Answer: {response}

      Rate the answer on a scale of 0-10 where:
      - 10: Perfect, accurate answer
      - 8-9: Mostly correct with minor issues
      - 6-7: Partially correct
      - 4-5: Some relevant information but mostly incorrect
      - 0-3: Completely wrong or irrelevant

      Respond with a JSON object containing 'score' (number) and 'explanation' (string).
    scoring:
      - "Factual accuracy"
      - "Completeness of answer"
      - "Clarity and coherence"

  output:
    path: "data/results/qa_evaluation.json"

  providers:
    azure_openai:
      type: azure_openai
      endpoint: ${ENDPOINT}
      api_version: ${API_VERSION}
      api_key: ${API_KEY}
      deployment: ${DEPLOYMENT}
      model: ${MODEL}

# Edit the dataset file
nano data/datasets/basic.yaml

dataset:
  name: "Simple Q&A Dataset"
  description: "Basic factual questions for capability assessment"
  version: "1.0"

  questions:
    - id: "geo_001"
      prompt: "What is the capital of France?"
      expected: "Paris"
      category: "geography"
      difficulty: "easy"

    - id: "sci_001"
      prompt: "What is the chemical symbol for gold?"
      expected: "Au"
      category: "science"
      difficulty: "easy"

2. Run the benchmark

python -m app run data/suites/basic.yaml

3. View the results

Results will be saved to data/results/qa_evaluation.json
Console will display a summary table with scores, latency, and status

How to Contribute

We welcome contributions from the community! To get started: 1. Fork the repo 2. Create a feature branch: git checkout -b feature/my-feature 3. Make your changes and write tests if applicable 4. Run benchmarks or linters to verify your update 5. Submit a pull request with a clear description

If you’re not sure where to start, check out our Issues page.

Next Steps

Here’s a preview of what’s coming next or areas where you can contribute:

🤖 AutoGen support: Add native support for AutoGen-style multi-agent evaluation
🧠 Include larger & SOTA datasets: Add benchmarks like GSM8K, ARC, MMLU
📚 Support RAG benchmarks: Benchmark retrieval-augmented pipelines via synthetic queries or hybrid eval
🔧 Fine-tuning integration: Evaluate checkpoints pre/post fine-tuning on key tasks

Want to lead one of these features? Open an issue or propose a draft!

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
app		app
data		data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
batch_extract.py		batch_extract.py
logo.png		logo.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarker

💡 Why?

⚙️ Setup

VS Code Dev Container Setup (Recommended)

⚙️ Run

1. Edit the suite (Q&A suite)

2. Run the benchmark

3. View the results

How to Contribute

Next Steps

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarker

💡 Why?

⚙️ Setup

VS Code Dev Container Setup (Recommended)

⚙️ Run

1. Edit the suite (Q&A suite)

2. Run the benchmark

3. View the results

How to Contribute

Next Steps

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages