This guide provides detailed instructions for installing and configuring BOND.
- Python: 3.11 or higher
- Operating System: Linux, macOS, or Windows
- Memory: 8GB RAM minimum (16GB+ recommended)
- Disk Space: ~5GB for ontology database and FAISS indices
- Optional: GPU for faster embedding inference (CUDA-compatible)
git clone https://github.com/Aronow-Lab/BOND.git
cd BOND# Create virtual environment
python3.11 -m venv bond_venv
# Activate (Linux/macOS)
source bond_venv/bin/activate
# Activate (Windows)
bond_venv\Scripts\activate# Upgrade pip
pip install --upgrade pip
# Install BOND package (editable mode)
pip install -e .Or install with development dependencies:
pip install -e ".[dev]"You need an SQLite database containing ontology terms. Detailed information on how to create, update, and manage the ontology database can be found in assets/README.md.
You have two options:
If you have access to a pre-built ontologies.sqlite file:
mkdir -p assets
cp /path/to/ontologies.sqlite assets/ontologies.sqlite# Generate SQLite database from OBO/OWL files
bond-generate-sqlite \
--input_dir /path/to/ontology/files \
--output_path assets/ontologies.sqliteThe script supports:
- OBO format files (
.obo) - OWL format files (
.owl) - JSON-LD format
Required ontologies for full functionality:
- Cell Ontology (CL)
- UBERON
- MONDO Disease Ontology
- Experimental Factor Ontology (EFO)
- PATO
- HANCESTRO
- NCBI Taxonomy
- Organism-specific development stage ontologies (HsapDv, MmusDv, etc.)
Create or update the abbreviations dictionary at assets/abbreviations.json to improve query matching for abbreviated terms. This file is optional but recommended:
# Create abbreviations file
mkdir -p assets
cat > assets/abbreviations.json << 'EOF'
{
"cell_type": {
"t": "t cell",
"nk": "natural killer cell",
"dc": "dendritic cell",
"b": "b cell",
"mono": "monocyte",
"mφ": "macrophage",
"neu": "neutrophil"
},
"tissue": {
"bm": "bone marrow",
"ln": "lymph node",
"spl": "spleen"
}
}
EOFBefore building the FAISS index, you need to configure your embedding model. The FAISS index must be built with the same embedding model you'll use at runtime.
Important: Configure your embedding model in Step 6 (Environment Configuration) before building FAISS in Step 7.
See the Selecting Your Encoder section below for detailed options.
Build the FAISS index for dense semantic search:
bond-build-faiss \
--sqlite_path assets/ontologies.sqlite \
--assets_path assets \
--embed_model st:all-MiniLM-L6-v2Note: This step requires:
- Embedding model configured in
.envfile (see Step 5 and Step 7) - Several hours for large ontology databases
- Sufficient disk space (~2-5GB)
Important: Make sure you've configured your embedding model in the .env file (Step 7) before running this command, as the FAISS index must match your runtime embedding model.
Create a .env file in the project root:
# Embedding Model Configuration
# Options:
# - st:all-MiniLM-L6-v2 (Sentence Transformers, default)
# - st:sentence-transformers/all-mpnet-base-v2
# - litellm/http://your-embedding-service
BOND_EMBED_MODEL=ollama/rajdeopankaj/bond-embed-v1-fp16:latest
# LLM Providers for Expansion and Disambiguation
# You need at least one configured
# Option 1: Anthropic Claude
BOND_EXPANSION_LLM=anthropic/claude-3-5-sonnet-20241022
BOND_DISAMBIGUATION_LLM=anthropic/claude-3-5-sonnet-20241022
ANTHROPIC_API_KEY=your-anthropic-api-key
# Option 2: OpenAI GPT
# BOND_EXPANSION_LLM=openai/gpt-4o
# BOND_DISAMBIGUATION_LLM=openai/gpt-4o
# OPENAI_API_KEY=your-openai-api-key
# Option 3: Other LiteLLM-compatible providers
# BOND_EXPANSION_LLM=cohere/command-r-plus
# BOND_DISAMBIGUATION_LLM=cohere/command-r-plus
# COHERE_API_KEY=your-cohere-api-key
# Paths (defaults shown)
BOND_ASSETS_PATH=assets
BOND_SQLITE_PATH=assets/ontologies.sqlite
BOND_RERANKER_PATH=reranker-model/
# Optional: Retrieval-only mode (skip LLM stages)
# BOND_RETRIEVAL_ONLY=1
# Optional: API Authentication
# BOND_API_KEY=your-secret-api-key
# BOND_ALLOW_ANON=1 # Allow anonymous access (development only)The reranker model improves accuracy by 10-15%. It's optional but recommended:
- Download from Hugging Face: https://huggingface.co/AronowLab/BOND-reranker
- Extract model files to
reranker-model/directory:mkdir -p reranker-model # Download and extract model files to reranker-model/ - Verify files: The directory should contain:
config.jsonmodel.safetensors(orpytorch_model.bin)tokenizer_config.jsonvocab.txt- Other tokenizer files
Note: The BOND_RERANKER_PATH in your .env file should point to this directory (default: reranker-model/). See reranker-model/README.md for detailed instructions.
Verify that all components are properly installed:
# Check SQLite database exists
ls -lh assets/ontologies.sqlite
# Check FAISS index exists
ls -lh assets/faiss_store/embeddings.faiss
ls -lh assets/faiss_store/id_map.npy
# Check abbreviations file (optional)
ls -lh assets/abbreviations.json
# Check reranker model (optional)
ls -lh reranker-model/config.json# Check CLI works
bond-query --help
# Test query (requires database and FAISS index)
bond-query \
--query "T-cell" \
--field cell_type \
--organism "Homo sapiens" \
--tissue "blood"from bond import BondMatcher
from bond.config import BondSettings
# Initialize matcher
settings = BondSettings()
matcher = BondMatcher(settings)
# Test query
result = matcher.query(
query="T-cell",
field_name="cell_type",
organism="Homo sapiens",
tissue="blood"
)
print(f"Matched: {result['chosen']['label']}")
print(f"Ontology ID: {result['chosen']['id']}")# Start server (if API key is set)
bond-serve
# In another terminal:
curl http://localhost:8000/healthA Dockerfile is provided for containerized deployment:
# Build image
docker build -t bond:latest .
# Run container
docker run -p 8000:8000 \
-v $(pwd)/assets:/app/assets \
-e BOND_API_KEY=your-key \
-e ANTHROPIC_API_KEY=your-key \
bond:latestSolution: Ensure the ontology database exists at the specified path:
ls -lh assets/ontologies.sqliteSolution: Build the FAISS index:
bond-build-faiss --sqlite_path assets/ontologies.sqlite --assets_path assetsSolutions:
- Verify API keys are set correctly
- Check API key permissions (write access required)
- Ensure sufficient API credits/quota
- Try a different LLM provider
Solutions:
- Build index with smaller batch size
- Use CPU-only FAISS (faiss-cpu) instead of GPU version
- Process ontologies in chunks
Solution: Ensure virtual environment is activated and dependencies installed:
source bond_venv/bin/activate
pip install -e .- Read the README.md for usage examples
- Explore Hybrid Search Guide for advanced features
- Review Reranker Training Guide for custom model training. See notebooks/ for training code and example notebooks
For questions and support, see the resources below.
- Benchmark Dataset: HuggingFace Dataset
- Paper: Multi-agent AI System for High Quality Metadata Curation at Scale - Related multi-agent curation system
- Issues: GitHub Issues
Important: Configure your embedding model before building the FAISS index (Step 6). The FAISS index must be built with the same embedding model you'll use at runtime.
You can use your published encoders with BOND.
- Pull the model:
ollama pull rajdeopankaj/bond-embed-v1-fp16- Set the env var (e.g., in
.env):
BOND_EMBED_MODEL=ollama:rajdeopankaj/bond-embed-v1-fp16
# OLLAMA_API_BASE=http://localhost:11434 # if remote, set your host- Build FAISS:
bond-build-faiss --sqlite_path assets/ontologies.sqlite --assets_path assets- Deploy
pankajrajdeo/bond-embed-v1-fp16behind a LiteLLM-compatible endpoint (e.g., TEI + gateway). - Set the env var to the routed model name, for example:
BOND_EMBED_MODEL=litellm:huggingface/teimodel- Build FAISS as usual.
References: