BioMedArena

Model	HLE-Verified-Gold (Biomed+Chem)		BixBench		LAB-Bench 2		Super Chemistry
Model	Baseline	Ours	Baseline	Ours	Baseline	Ours	Baseline	Ours
SOTA	46.8		80.5		80.0		38.5
Trinity-Large-Thinking	-	-	-	-	-	-	-	-
NVIDIA Nemotron-3 Super 120B	10.7	29.5	-	-	-	-	-	-
INTELLECT-3.1	17.4	24.2	5.4	20.0	-	-	-	-
GLM-4.5	-	-	-	-	-	-	-	-
Qwen3.5-397B-A17B	-	-	-	-	-	-	-	-
Claude Sonnet 4.5	20.8	41.6	17.1	42.4	48.1	71.3	-	-
Claude Opus 4.5	-	-	-	-	-	-	-	-
Claude Sonnet 4.6	23.5	43.6	40.5	48.3	-	-	-	-
Claude Opus 4.6	-	-	-	-	-	-	-	-
GPT-5.4	-	-	-	-	-	-	-	-
Gemini 3 Flash	38.26	50.34	38.54	69.27	-	-	-	-
Gemini 3.1 Pro	-	-	43.41	85.85	-	-	-	-

BioMedArena is a biomedical agent evaluation harness for comparing LLM backbones, tool-use modes, scorers, and datasets behind one CLI. It currently has 147 registered benchmarks, 75 tools, 4 modes, and 8 registered model IDs.

The project is designed as a practical research surface: add a dataset, choose a harness mode, expose a tool pack, run a matrix, and compare whether agentic behavior actually improves biomedical, medical, chemistry, biology, protein, genomics, DNA/RNA, and healthcare tasks.

Quick Check

After installing dependencies, run the offline smoke suite:

python3 scripts/run_quick_suite.py

Expected healthy output:

147 registered benchmarks
75 registered tools
4 registered modes
20/20 scorer checks passed

For the stricter offline release gate:

python3 scripts/release_gate.py --strict

Installation

git clone https://github.com/AI-in-Health/BioMedArena.git
cd BioMedArena

python3.11 -m venv .venv
source .venv/bin/activate

python -m pip install -e ".[dev,eval,provider-gemini]"

cp .env.example .env

Fill at least one model provider key in .env:

OPENAI_API_KEY=<your-openai-api-key>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
HF_TOKEN=<your-huggingface-token-for-gated-benchmarks>

Gated HuggingFace datasets also require accepting the dataset terms in the browser before HF_TOKEN can load them. See .env.example for optional domain-specific keys such as NCBI, OMIM, Serper, and Jina.

Basic Usage

List available resources:

bioagent list-benchmarks
bioagent list-backbones
bioagent list-modes

The package name is biomedarena; the command-line entry point remains bioagent for compatibility. Environment variables use the BIOAGENT_ prefix for the same reason.

Run one benchmark cell:

bioagent run \
  --benchmark medcalc \
  --backbone gemini-2.5-flash \
  --tools biomed --reasoning-mode light \
  --limit 5 \
  --output result.json

Run a small matrix cell:

python3 scripts/run_matrix.py \
  --config configs/matrix_default.yaml \
  --only medcalc,gemini,simple_llm \
  --limit-override 1

Check official source accessibility before spending model budget:

python3 scripts/verify_benchmark_sources.py --benchmarks all

Execution Modes

The public CLI exposes four modes:

Mode	Purpose
`simple_llm`	Pure model baseline, no tools.
`deep_think`	Native model reasoning/thinking path where supported.
`light`	Single-turn function/tool calling.
`heavy`	Multi-turn ReAct loop with tool retrieval.

A unified CLI interface is also available via --tools / --reasoning-mode / --enable-thinking flags, which map to the modes above:

`--tools`	`--reasoning-mode`	Internal mode	Thinking
`off`	(n/a)	`deep_think`	ON (default)
`off` + `--enable-thinking 0`	(n/a)	`simple_llm`	OFF
`biomed` / `search` / `all`	`light`	`light`	OFF
`biomed` / `search` / `all`	`heavy`	`heavy`	ON

The legacy --mode / --web-tools flags remain supported for backward compatibility. Add --self-consistency to wrap any mode with majority voting.

Documentation

The root README stays short on purpose. Detailed release information lives in docs/:

Security

python_exec can execute model-supplied Python with timeout and basic denylist checks. Treat this as a convenience guard, not a hardened sandbox. Run untrusted workloads in an isolated container or VM, keep secrets out of the working directory, and disable code-execution or web-search tools for private data unless you have reviewed the policy.

External tools may call third-party APIs and public databases. Review the benchmark and tool inventories before running sensitive workloads.

Testing

python3 scripts/run_quick_suite.py
python3 scripts/release_gate.py --strict
python3 -m pytest tests/unit -q
python3 -m pytest tests/smoke -q -m "not slow"

License

See LICENSE. Ported life-science skill attribution is tracked in harness/tools/openai_ported/NOTICE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
baselines/langgraph_react		baselines/langgraph_react
configs		configs
docs		docs
harness		harness
scripts		scripts
tests		tests
vendors		vendors
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
config_claude.yaml		config_claude.yaml
config_gemini.yaml		config_gemini.yaml
pyproject.toml		pyproject.toml
quick_run.sh		quick_run.sh
result_table.md		result_table.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioMedArena

Quick Check

Installation

Basic Usage

Execution Modes

Documentation

Security

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioMedArena

Quick Check

Installation

Basic Usage

Execution Modes

Documentation

Security

Testing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages