WIP: this repository is still in active development.
BioTaskBench is a benchmark suite for evaluating LLM/agent bioinformatics capability, plus a unified harness for running BioTaskBench and external benchmark suites under one CLI.
- A BioTaskBench test corpus across 10 domains under
tests/(currently 33 tests). - Deterministic grading harness with weighted multi-criteria scoring:
file_check,column_check,exact_match,range_checkset_overlap,numeric_correlation,code_executes(plusllm_judgestub)
- Coverage and score reporting:
coverage= attempted / totalcompletion_rate= tests with score > 0score= mean across attempted testsscore_overall= mean across all tests- domain and difficulty breakdowns
- Multi-suite runner:
biotaskbench,bioagent-bench,biocoder,bixbench,all
- External adapter scaffolds with:
- setup/listing
- optional command execution hooks
- optional precomputed results JSON ingestion
- score normalization
- Utilities for benchmark health:
- data-size audit (
<10MB/test,<500MB total) - flakiness audit across repeated run artifacts
- adapter readiness/status diagnostics
- data-size audit (
- Automation scripts for Phase 4/5 calibration/baseline workflows.
harness/: CLI, runner, grader, reporter, schemas, adapter integrations, audit utilitiestests/: BioTaskBench domain manifests + task definitions + data + expected outputstests_harness/: harness unit/integration testsmock_outputs/: deterministic mock outputs for regression runsreference/: writing guide + benchmark surveyscripts/: calibration and baseline run scripts
# Validate suite/task schemas and expected artifacts
benchmarkAgentBfx validate --tests-root tests
# Run BioTaskBench using precomputed workspace outputs
benchmarkAgentBfx run --suite biotaskbench --tests-root tests --workspace-root mock_outputs
# Run BioTaskBench by executing an agent command per test
benchmarkAgentBfx run --suite biotaskbench --domain chip-seq --agent-cmd "python agent.py"
# Run all suites (BioTaskBench + external adapters)
benchmarkAgentBfx run --suite all --tests-root tests --workspace-root mock_outputs
# Compare and report
benchmarkAgentBfx compare results/run-a/run.json results/run-b/run.json
benchmarkAgentBfx report results/run-a/run.json results/run-b/run.json --output results/report.md
# Benchmark health audits
benchmarkAgentBfx audit-data --tests-root tests
benchmarkAgentBfx audit-flaky results/run-1/run.json results/run-2/run.json results/run-3/run.json --threshold 0.3
# External adapter readiness snapshot
benchmarkAgentBfx adapter-status- Results ingestion:
BIOAGENT_BENCH_RESULTS_JSONBIOCODER_RESULTS_JSONBIXBENCH_RESULTS_JSON
- Optional execution commands:
BIOAGENT_BENCH_RUN_CMDBIOCODER_RUN_CMDBIXBENCH_RUN_CMD
- Optional roots/catalogs:
BIOAGENT_BENCH_ROOT,BIOAGENT_BENCH_TESTS_JSONBIOCODER_ROOT,BIOCODER_TESTS_JSONBIXBENCH_ROOT,BIXBENCH_TESTS_JSON
# Replica calibration + flakiness/data audits
./scripts/run_phase4_calibration.sh
# Baseline matrix (without/with skills when BIO_SKILLS_PATH is set)
BIO_SKILLS_PATH=~/.claude/skills ./scripts/run_phase5_baselines.sh