| Model | HLE-Verified-Gold (Biomed+Chem) | BixBench | LAB-Bench 2 | Super Chemistry | ||||
|---|---|---|---|---|---|---|---|---|
| Baseline | Ours | Baseline | Ours | Baseline | Ours | Baseline | Ours | |
| SOTA | 46.8 | 80.5 | 80.0 | 38.5 | ||||
| Trinity-Large-Thinking | - | - | - | - | - | - | - | - |
| NVIDIA Nemotron-3 Super 120B | 10.7 | 29.5 | - | - | - | - | - | - |
| INTELLECT-3.1 | 17.4 | 24.2 | 5.4 | 20.0 | - | - | - | - |
| GLM-4.5 | - | - | - | - | - | - | - | - |
| Qwen3.5-397B-A17B | - | - | - | - | - | - | - | - |
| Claude Sonnet 4.5 | 20.8 | 41.6 | 17.1 | 42.4 | 48.1 | 71.3 | - | - |
| Claude Opus 4.5 | - | - | - | - | - | - | - | - |
| Claude Sonnet 4.6 | 23.5 | 43.6 | 40.5 | 48.3 | - | - | - | - |
| Claude Opus 4.6 | - | - | - | - | - | - | - | - |
| GPT-5.4 | - | - | - | - | - | - | - | - |
| Gemini 3 Flash | 38.26 | 50.34 | 38.54 | 69.27 | - | - | - | - |
| Gemini 3.1 Pro | - | - | 43.41 | 85.85 | - | - | - | - |
BioMedArena is a biomedical agent evaluation harness for comparing LLM backbones, tool-use modes, scorers, and datasets behind one CLI. It currently has 147 registered benchmarks, 75 tools, 4 modes, and 8 registered model IDs.
The project is designed as a practical research surface: add a dataset, choose a harness mode, expose a tool pack, run a matrix, and compare whether agentic behavior actually improves biomedical, medical, chemistry, biology, protein, genomics, DNA/RNA, and healthcare tasks.
After installing dependencies, run the offline smoke suite:
python3 scripts/run_quick_suite.pyExpected healthy output:
- 147 registered benchmarks
- 75 registered tools
- 4 registered modes
- 20/20 scorer checks passed
For the stricter offline release gate:
python3 scripts/release_gate.py --strictgit clone https://github.com/AI-in-Health/BioMedArena.git
cd BioMedArena
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,eval,provider-gemini]"
cp .env.example .envFill at least one model provider key in .env:
OPENAI_API_KEY=<your-openai-api-key>
ANTHROPIC_API_KEY=<your-anthropic-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
HF_TOKEN=<your-huggingface-token-for-gated-benchmarks>Gated HuggingFace datasets also require accepting the dataset terms in
the browser before HF_TOKEN can load them. See .env.example for
optional domain-specific keys such as NCBI, OMIM, Serper, and Jina.
List available resources:
bioagent list-benchmarks
bioagent list-backbones
bioagent list-modesThe package name is biomedarena; the command-line entry point remains
bioagent for compatibility. Environment variables use the BIOAGENT_
prefix for the same reason.
Run one benchmark cell:
bioagent run \
--benchmark medcalc \
--backbone gemini-2.5-flash \
--tools biomed --reasoning-mode light \
--limit 5 \
--output result.jsonRun a small matrix cell:
python3 scripts/run_matrix.py \
--config configs/matrix_default.yaml \
--only medcalc,gemini,simple_llm \
--limit-override 1Check official source accessibility before spending model budget:
python3 scripts/verify_benchmark_sources.py --benchmarks allThe public CLI exposes four modes:
| Mode | Purpose |
|---|---|
simple_llm |
Pure model baseline, no tools. |
deep_think |
Native model reasoning/thinking path where supported. |
light |
Single-turn function/tool calling. |
heavy |
Multi-turn ReAct loop with tool retrieval. |
A unified CLI interface is also available via --tools / --reasoning-mode /
--enable-thinking flags, which map to the modes above:
--tools |
--reasoning-mode |
Internal mode | Thinking |
|---|---|---|---|
off |
(n/a) | deep_think |
ON (default) |
off + --enable-thinking 0 |
(n/a) | simple_llm |
OFF |
biomed / search / all |
light |
light |
OFF |
biomed / search / all |
heavy |
heavy |
ON |
The legacy --mode / --web-tools flags remain supported for backward
compatibility. Add --self-consistency to wrap any mode with majority voting.
The root README stays short on purpose. Detailed release information
lives in docs/:
python_exec can execute model-supplied Python with timeout and basic
denylist checks. Treat this as a convenience guard, not a hardened
sandbox. Run untrusted workloads in an isolated container or VM, keep
secrets out of the working directory, and disable code-execution or
web-search tools for private data unless you have reviewed the policy.
External tools may call third-party APIs and public databases. Review the benchmark and tool inventories before running sensitive workloads.
python3 scripts/run_quick_suite.py
python3 scripts/release_gate.py --strict
python3 -m pytest tests/unit -q
python3 -m pytest tests/smoke -q -m "not slow"See LICENSE. Ported life-science skill attribution is tracked in harness/tools/openai_ported/NOTICE.md.