OpenBench is a fast, standardized evaluation framework built by @groqinc for reproducible LLM benchmarking.
This guide shows how to evaluate Nebius AI Studio-hosted open models (like Meta Llama 3 and Qwen) on benchmarks like MMLU, using OpenBench and a single terminal command.
Benchmarking lets you measure how well a language model performs on tasks like logic, math, code, or knowledge recall.
Itβs how we compare models like Llama 3, GPT-4, Claude, or Qwen using standardized tests (e.g. MMLU, GPQA, HumanEval).
We'll run a short evaluation on Llama-3.3-70B-Instruct-fast hosted by Nebius AI Studio:
bench eval mmlu \
--model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
--limit 12 \
--temperature 0.6 \
--timeout 30000 \
--max-connections 40 \
--logfile logs/mmlu_sample.jsonlYou'll get back accuracy, token counts, and logs in under 15 seconds.
Install the uv Python environment manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"git clone https://github.com/groq/openbench.git
cd openbench
uv venv
source .venv/bin/activate
uv pip install -e .- Visit studio.nebius.com
- Sign in with GitHub or Google
- Go to Account Settings β API Keys and generate one
Then set the following environment variables:
export OPENAI_API_KEY=your_nebius_api_key_here
export OPENAI_BASE_URL=https://api.studio.nebius.com/v1
export INSPECT_MAX_CONNECTIONS=40
β οΈ Note: OpenBench uses the OpenAI-compatible SDK. The Nebius API works seamlessly usingOPENAI_API_KEY.
Run a short MMLU benchmark on Llama-3.3-70B-Instruct-fast:
bench eval mmlu \
--model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
--limit 12 \
--temperature 0.6 \
--timeout 30000 \
--max-connections 40 \
--logfile logs/mmlu_sample.jsonlThis evaluates the model on 12 academic-style questions (from philosophy to physics).
You can view results via log file:
cat logs/mmlu_sample.jsonlOr launch the local results viewer:
bench viewThen visit http://localhost:7575 in your browser (if not blocked by firewall settings).
Any Nebius-hosted model available in AI Studio will work.
You can try:
openai/meta-llama/Meta-Llama-3.1-70B-Instructopenai/meta-llama/Llama-3.3-70B-Instruct-fast- and othersβ¦
Just make sure the model ID you pass matches Nebiusβs naming format.
To list all available tests:
bench listSome great quick ones:
humanevalβ for code generationopenbookqaβ elementary sciencegpqa_diamondβ graduate-level biology/chem/physicssimpleqaβ short factual answers
Huge shoutout to:
- @AarushSah_ and the Groq team for building OpenBench
- Inspect from the UK AI Safety Institute, which powers OpenBench's adapter layer
Running evaluations directly against production models β using the exact same APIs your apps will call β is the only way to know how your model will behave in the real world.
This is invaluable for:
- Comparing model variants
- Tracking regressions over time
- Validating fine-tuned versions
- Reporting scores externally
Nebius AI Studio provides hosted inference for top OSS models, fast startup, and zero-retention API usage β all from Europe.