Skip to content

AI-in-Health/BioMedArena

Repository files navigation

BioMedArena

Model HLE-Verified-Gold (Biomed+Chem) BixBench LAB-Bench 2 Super Chemistry
Baseline Ours Baseline Ours Baseline Ours Baseline Ours
SOTA 46.8 80.5 80.0 38.5
Trinity-Large-Thinking -- -- -- --
NVIDIA Nemotron-3 Super 120B 10.729.5 -- -- --
INTELLECT-3.1 17.424.2 5.420.0 -- --
GLM-4.5 -- -- -- --
Qwen3.5-397B-A17B -- -- -- --
Claude Sonnet 4.5 20.841.6 17.142.4 48.171.3 --
Claude Opus 4.5 -- -- -- --
Claude Sonnet 4.6 23.543.6 40.548.3 -- --
Claude Opus 4.6 -- -- -- --
GPT-5.4 -- -- -- --
Gemini 3 Flash 38.2650.34 38.5469.27 -- --
Gemini 3.1 Pro -- 43.4185.85 -- --

BioMedArena is a biomedical agent evaluation harness for comparing LLM backbones, tool-use modes, scorers, and datasets behind one CLI. It currently has 147 registered benchmarks, 75 tools, 4 modes, and 8 registered model IDs.

The project is designed as a practical research surface: add a dataset, choose a harness mode, expose a tool pack, run a matrix, and compare whether agentic behavior actually improves biomedical, medical, chemistry, biology, protein, genomics, DNA/RNA, and healthcare tasks.

Quick Check

After installing dependencies, run the offline smoke suite:

python3 scripts/run_quick_suite.py

Expected healthy output:

  • 147 registered benchmarks
  • 75 registered tools
  • 4 registered modes
  • 20/20 scorer checks passed

For the stricter offline release gate:

python3 scripts/release_gate.py --strict

Installation

git clone https://github.com/AI-in-Health/BioMedArena.git
cd BioMedArena

python3.11 -m venv .venv
source .venv/bin/activate

python -m pip install -e ".[dev,eval,provider-gemini]"

cp .env.example .env

Fill at least one model provider key in .env:

OPENAI_API_KEY=<your-openai-api-key>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
HF_TOKEN=<your-huggingface-token-for-gated-benchmarks>

Gated HuggingFace datasets also require accepting the dataset terms in the browser before HF_TOKEN can load them. See .env.example for optional domain-specific keys such as NCBI, OMIM, Serper, and Jina.

Basic Usage

List available resources:

bioagent list-benchmarks
bioagent list-backbones
bioagent list-modes

The package name is biomedarena; the command-line entry point remains bioagent for compatibility. Environment variables use the BIOAGENT_ prefix for the same reason.

Run one benchmark cell:

bioagent run \
  --benchmark medcalc \
  --backbone gemini-2.5-flash \
  --tools biomed --reasoning-mode light \
  --limit 5 \
  --output result.json

Run a small matrix cell:

python3 scripts/run_matrix.py \
  --config configs/matrix_default.yaml \
  --only medcalc,gemini,simple_llm \
  --limit-override 1

Check official source accessibility before spending model budget:

python3 scripts/verify_benchmark_sources.py --benchmarks all

Execution Modes

The public CLI exposes four modes:

Mode Purpose
simple_llm Pure model baseline, no tools.
deep_think Native model reasoning/thinking path where supported.
light Single-turn function/tool calling.
heavy Multi-turn ReAct loop with tool retrieval.

A unified CLI interface is also available via --tools / --reasoning-mode / --enable-thinking flags, which map to the modes above:

--tools --reasoning-mode Internal mode Thinking
off (n/a) deep_think ON (default)
off + --enable-thinking 0 (n/a) simple_llm OFF
biomed / search / all light light OFF
biomed / search / all heavy heavy ON

The legacy --mode / --web-tools flags remain supported for backward compatibility. Add --self-consistency to wrap any mode with majority voting.

Documentation

The root README stays short on purpose. Detailed release information lives in docs/:

Security

python_exec can execute model-supplied Python with timeout and basic denylist checks. Treat this as a convenience guard, not a hardened sandbox. Run untrusted workloads in an isolated container or VM, keep secrets out of the working directory, and disable code-execution or web-search tools for private data unless you have reviewed the policy.

External tools may call third-party APIs and public databases. Review the benchmark and tool inventories before running sensitive workloads.

Testing

python3 scripts/run_quick_suite.py
python3 scripts/release_gate.py --strict
python3 -m pytest tests/unit -q
python3 -m pytest tests/smoke -q -m "not slow"

License

See LICENSE. Ported life-science skill attribution is tracked in harness/tools/openai_ported/NOTICE.md.

About

BioMedArena: a state-of-the-art biomedical harness for evaluating AI agents at scale - 100+ benchmarks, 70+ tools

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors