Skip to content

demianarc/openbench-nebius-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 

Repository files navigation

πŸ§ͺ End‑to‑End Guide β–Έ Evaluating Nebius AI Studio Models with OpenBench

OpenBench is a fast, standardized evaluation framework built by @groqinc for reproducible LLM benchmarking.

This guide shows how to evaluate Nebius AI Studio-hosted open models (like Meta Llama 3 and Qwen) on benchmarks like MMLU, using OpenBench and a single terminal command.


🧠 What is Model Benchmarking?

Benchmarking lets you measure how well a language model performs on tasks like logic, math, code, or knowledge recall.
It’s how we compare models like Llama 3, GPT-4, Claude, or Qwen using standardized tests (e.g. MMLU, GPQA, HumanEval).


⚑ Quick Preview

We'll run a short evaluation on Llama-3.3-70B-Instruct-fast hosted by Nebius AI Studio:

bench eval mmlu \
  --model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
  --limit 12 \
  --temperature 0.6 \
  --timeout 30000 \
  --max-connections 40 \
  --logfile logs/mmlu_sample.jsonl

You'll get back accuracy, token counts, and logs in under 15 seconds.


πŸ› οΈ Full Setup (Step-by-Step)

1. Install uv

Install the uv Python environment manager:

curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

2. Clone and install OpenBench

git clone https://github.com/groq/openbench.git
cd openbench
uv venv
source .venv/bin/activate
uv pip install -e .

3. Get your Nebius API key

  • Visit studio.nebius.com
  • Sign in with GitHub or Google
  • Go to Account Settings β†’ API Keys and generate one

Then set the following environment variables:

export OPENAI_API_KEY=your_nebius_api_key_here
export OPENAI_BASE_URL=https://api.studio.nebius.com/v1
export INSPECT_MAX_CONNECTIONS=40

⚠️ Note: OpenBench uses the OpenAI-compatible SDK. The Nebius API works seamlessly using OPENAI_API_KEY.


βœ… Run the Benchmark (Example: MMLU)

Run a short MMLU benchmark on Llama-3.3-70B-Instruct-fast:

bench eval mmlu \
  --model openai/meta-llama/Llama-3.3-70B-Instruct-fast \
  --limit 12 \
  --temperature 0.6 \
  --timeout 30000 \
  --max-connections 40 \
  --logfile logs/mmlu_sample.jsonl

This evaluates the model on 12 academic-style questions (from philosophy to physics).


πŸ“Š View Results

You can view results via log file:

cat logs/mmlu_sample.jsonl

Or launch the local results viewer:

bench view

Then visit http://localhost:7575 in your browser (if not blocked by firewall settings).


πŸ€” What Models Can I Test?

Any Nebius-hosted model available in AI Studio will work.
You can try:

  • openai/meta-llama/Meta-Llama-3.1-70B-Instruct
  • openai/meta-llama/Llama-3.3-70B-Instruct-fast
  • and others…

Just make sure the model ID you pass matches Nebius’s naming format.


πŸ§ͺ Try Other Benchmarks

To list all available tests:

bench list

Some great quick ones:

  • humaneval – for code generation
  • openbookqa – elementary science
  • gpqa_diamond – graduate-level biology/chem/physics
  • simpleqa – short factual answers

πŸ™Œ Credits

Huge shoutout to:

  • @AarushSah_ and the Groq team for building OpenBench
  • Inspect from the UK AI Safety Institute, which powers OpenBench's adapter layer

πŸ’‘ Why This Matters

Running evaluations directly against production models β€” using the exact same APIs your apps will call β€” is the only way to know how your model will behave in the real world.

This is invaluable for:

  • Comparing model variants
  • Tracking regressions over time
  • Validating fine-tuned versions
  • Reporting scores externally

πŸ”— Nebius AI Studio

Nebius AI Studio provides hosted inference for top OSS models, fast startup, and zero-retention API usage β€” all from Europe.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors