🧪 End‑to‑End Guide ▸ Evaluating Nebius AI Studio Models with OpenBench

OpenBench is a fast, standardized evaluation framework built by @groqinc for reproducible LLM benchmarking.

This guide shows how to evaluate Nebius AI Studio-hosted open models (like Meta Llama 3 and Qwen) on benchmarks like MMLU, using OpenBench and a single terminal command.

🧠 What is Model Benchmarking?

Benchmarking lets you measure how well a language model performs on tasks like logic, math, code, or knowledge recall.
It’s how we compare models like Llama 3, GPT-4, Claude, or Qwen using standardized tests (e.g. MMLU, GPQA, HumanEval).

⚡ Quick Preview

We'll run a short evaluation on Llama-3.3-70B-Instruct-fast hosted by Nebius AI Studio:

bench eval mmlu \
  --model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
  --limit 12 \
  --temperature 0.6 \
  --timeout 30000 \
  --max-connections 40 \
  --logfile logs/mmlu_sample.jsonl

You'll get back accuracy, token counts, and logs in under 15 seconds.

🛠️ Full Setup (Step-by-Step)

1. Install `uv`

Install the uv Python environment manager:

curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

2. Clone and install OpenBench

git clone https://github.com/groq/openbench.git
cd openbench
uv venv
source .venv/bin/activate
uv pip install -e .

3. Get your Nebius API key

Visit studio.nebius.com
Sign in with GitHub or Google
Go to Account Settings → API Keys and generate one

Then set the following environment variables:

export OPENAI_API_KEY=your_nebius_api_key_here
export OPENAI_BASE_URL=https://api.studio.nebius.com/v1
export INSPECT_MAX_CONNECTIONS=40

⚠️ Note: OpenBench uses the OpenAI-compatible SDK. The Nebius API works seamlessly using OPENAI_API_KEY.

✅ Run the Benchmark (Example: MMLU)

Run a short MMLU benchmark on Llama-3.3-70B-Instruct-fast:

bench eval mmlu \
  --model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
  --limit 12 \
  --temperature 0.6 \
  --timeout 30000 \
  --max-connections 40 \
  --logfile logs/mmlu_sample.jsonl

This evaluates the model on 12 academic-style questions (from philosophy to physics).

📊 View Results

You can view results via log file:

cat logs/mmlu_sample.jsonl

Or launch the local results viewer:

bench view

Then visit http://localhost:7575 in your browser (if not blocked by firewall settings).

🤔 What Models Can I Test?

Any Nebius-hosted model available in AI Studio will work.
You can try:

openai/meta-llama/Meta-Llama-3.1-70B-Instruct
openai/meta-llama/Llama-3.3-70B-Instruct-fast
and others…

Just make sure the model ID you pass matches Nebius’s naming format.

🧪 Try Other Benchmarks

To list all available tests:

bench list

Some great quick ones:

humaneval – for code generation
openbookqa – elementary science
gpqa_diamond – graduate-level biology/chem/physics
simpleqa – short factual answers

🙌 Credits

Huge shoutout to:

@AarushSah_ and the Groq team for building OpenBench
Inspect from the UK AI Safety Institute, which powers OpenBench's adapter layer

💡 Why This Matters

Running evaluations directly against production models — using the exact same APIs your apps will call — is the only way to know how your model will behave in the real world.

This is invaluable for:

Comparing model variants
Tracking regressions over time
Validating fine-tuned versions
Reporting scores externally

🔗 Nebius AI Studio

Nebius AI Studio provides hosted inference for top OSS models, fast startup, and zero-retention API usage — all from Europe.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 End‑to‑End Guide ▸ Evaluating Nebius AI Studio Models with OpenBench

🧠 What is Model Benchmarking?

⚡ Quick Preview

🛠️ Full Setup (Step-by-Step)

1. Install `uv`

2. Clone and install OpenBench

3. Get your Nebius API key

✅ Run the Benchmark (Example: MMLU)

📊 View Results

🤔 What Models Can I Test?

🧪 Try Other Benchmarks

🙌 Credits

💡 Why This Matters

🔗 Nebius AI Studio

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧪 End‑to‑End Guide ▸ Evaluating Nebius AI Studio Models with OpenBench

🧠 What is Model Benchmarking?

⚡ Quick Preview

🛠️ Full Setup (Step-by-Step)

1. Install uv

2. Clone and install OpenBench

3. Get your Nebius API key

✅ Run the Benchmark (Example: MMLU)

📊 View Results

🤔 What Models Can I Test?

🧪 Try Other Benchmarks

🙌 Credits

💡 Why This Matters

🔗 Nebius AI Studio

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

1. Install `uv`

Packages