Local biomedical retrieval and grounded question answering over PubMed.
Companion repository for the paper: Integrating AI and IR Paradigms for Sustainable and Trustworthy Accurate Access to Large Scale Biomedical Information — presented at ECIR 2026, Delft, The Netherlands.
- Overview
- What This Repository Provides
- Architecture
- Data: PubMed Processing and Qdrant Indexing
- Models
- Installation and Setup
- Running the Application
- Demo
- Repository Structure
- Citation
- License
Accessing reliable biomedical information at scale is a critical challenge. Researchers and clinicians need precise, evidence-grounded answers to their questions, drawn from the vast and continuously growing body of published literature. General-purpose LLMs can hallucinate facts; purely keyword-based search engines return documents but do not extract or synthesise the answer.
LocalBioRAG bridges this gap. It is a fully local, end-to-end Retrieval-Augmented Generation (RAG) system designed for biomedical question answering over the entire PubMed corpus (~38 million abstracts). Unlike cloud-based solutions, the complete pipeline — retrieval, passage extraction, and answer generation — runs on your own infrastructure with no external API calls, ensuring data privacy, reproducibility, and full control over every component.
| What | Why it matters |
|---|---|
| Complete application code | A working Flask web app that you can deploy on a GPU server and use immediately. |
| Hybrid retrieval pipeline | Combines BM25 sparse search with BGE-M3 dense re-ranking for high-recall, high-precision document retrieval. |
| LLM-based snippet extraction | A fine-tuned LoRA adapter on Llama 3.1 8B extracts exact supporting passages from each retrieved abstract. |
| Evidence-grounded answer generation | A second LoRA adapter generates concise answers using a context-windowing strategy that preserves snippet provenance. |
| Indexing guidance | Step-by-step instructions to reproduce the full PubMed index in Qdrant, so you can rebuild the system from scratch. |
| Web UI | A clean, responsive interface with example queries, Excel export, and a loading experience designed for live demos. |
The pipeline consists of three stages executed sequentially for each user query:
- The user query is sent to Qdrant which performs a BM25 sparse search and returns the top 100 candidate documents.
- Each candidate already carries a pre-computed BGE-M3 dense vector. The query is encoded with the same model and cosine similarity is computed against all 100 candidates.
- A hybrid score (BM25_normalised × cosine_similarity) is calculated. The top 10 documents are selected.
- For each of the 10 documents, both title and abstract are passed to Llama 3.1 8B Instruct equipped with a LoRA adapter fine-tuned for snippet extraction (trained on BioASQ data).
- The model returns exact text spans enclosed in
[BS]/[ES]tags. Documents without any extracted snippet are filtered out.
- The user can request an evidence-grounded answer. The top 3 documents are formatted using a context-windowing strategy: for each snippet, up to 20 words before and 10 words after are kept from the original abstract.
- A second LoRA adapter (multi-task, also trained on BioASQ) generates a concise (≤ 200 words) answer grounded in the extracted evidence.
To reproduce the full system you need to index PubMed into a Qdrant collection. Below are the detailed steps.
PubMed distributes its complete set of abstracts as XML files via the NLM FTP server:
- Baseline files (annual full snapshot): https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
- Update files (daily incremental): https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
- Documentation: https://pubmed.ncbi.nlm.nih.gov/download/
Download all .xml.gz files from the baseline directory. Each file contains
thousands of <PubmedArticle> records.
From each record, extract the following fields:
| Field name | Source in XML | Description |
|---|---|---|
pubmed_id |
<PMID> |
Unique PubMed identifier |
article_title |
<ArticleTitle> |
Title of the article |
abstract_text |
<AbstractText> |
Full abstract (concatenate all <AbstractText> elements) |
year |
<PubDate><Year> |
Publication year |
Title and abstract must be encoded separately with BGE-M3 and then combined into a single 1024-dimensional vector using a weighted average (0.2 × title + 0.8 × abstract):
import numpy as np
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
title_emb = model.encode([article_title], return_dense=True,
return_sparse=False, return_colbert_vecs=False)
abstract_emb = model.encode([abstract_text], return_dense=True,
return_sparse=False, return_colbert_vecs=False)
dense_vector = (
0.2 * np.array(title_emb["dense_vecs"][0])
+ 0.8 * np.array(abstract_emb["dense_vecs"][0])
).tolist() # list of 1024 floatsWhy a weighted average? Biomedical abstracts carry the bulk of the factual content, while titles are shorter and more general. The 80/20 split reflects this asymmetry and was found to yield better retrieval quality on BioASQ evaluation data.
Install Qdrant following the official documentation. A Docker deployment is the easiest path:
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrantCreate a collection with the required vector configuration.
The collection must be named pubmed_BGE (or update the COLLECTION_NAME
constant in retrieval_logic.py):
from qdrant_client import QdrantClient
from qdrant_client.http import models
client = QdrantClient(host="localhost", port=6333)
client.create_collection(
collection_name="pubmed_BGE",
vectors_config={
"dense_vector_BGE": models.VectorParams(
size=1024,
distance=models.Distance.COSINE,
on_disk=True, # recommended for ~38M vectors
),
},
sparse_vectors_config={
"sparse_vector": models.SparseVectorParams(
modifier=models.Modifier.IDF,
),
},
)Upload each record as a Qdrant point containing:
- The pre-computed dense vector under
"dense_vector_BGE". - A sparse BM25 vector under
"sparse_vector", computed server-side by Qdrant. Pass the concatenated title + abstract as amodels.Documentwithmodel="Qdrant/bm25". Theavg_lenoption sets the average document length (in characters) used by BM25 length normalisation; the value792.69was measured on the full PubMed corpus. - The payload fields needed by the application.
import uuid
AVG_DOCUMENT_LENGTH = 792.69 # avg chars of title + abstract across PubMed
full_text = f"{article_title} {abstract_text}"
client.upsert(
collection_name="pubmed_BGE",
points=[
models.PointStruct(
id=str(uuid.uuid4()),
vector={
"dense_vector_BGE": dense_vector,
"sparse_vector": models.Document(
text=full_text,
model="Qdrant/bm25",
options={"avg_len": AVG_DOCUMENT_LENGTH},
),
},
payload={
"pubmed_id": pubmed_id,
"article_title": article_title,
"abstract_text": abstract_text,
"year": year,
},
)
],
)Tip: Process files in batches (e.g. 100 points per
upsertcall) and keep a checkpoint of processed files to allow resuming after interruptions. The full PubMed baseline (~38 M records) can take several hours to index.
Summary of required Qdrant schema:
Type Name Details Named dense vector dense_vector_BGE1024-dim, cosine distance, on-disk (BGE-M3 weighted avg) Named sparse vector sparse_vectorBM25 with IDF modifier ( avg_len= 792.69)Payload field pubmed_idstring Payload field article_titlestring Payload field abstract_textstring Payload field yearstring
| Model | Link |
|---|---|
| Llama 3.1 8B Instruct | unsloth/Llama-3.1-8B-Instruct |
| Model | Link |
|---|---|
| BGE-M3 | BAAI/bge-m3 |
| Adapter | Purpose | Link |
|---|---|---|
| Snippet Extraction LoRA | Extracts relevant passages from titles and abstracts | sag-uniroma2/bioasq-snippet-extraction-lora |
| Answer Generation LoRA | Generates evidence-grounded answers from extracted passages | sag-uniroma2/bioasq-answer-generation-lora |
Note: Replace the placeholder links above with the actual Hugging Face repository URLs once the adapters are published.
- Python 3.10+
- A CUDA-capable GPU (recommended: ≥ 24 GB VRAM for Llama 3.1 8B with LoRA)
- A running Qdrant instance with the indexed PubMed collection (see above)
git clone https://github.com/your-org/LocalBioRAG.git
cd LocalBioRAG
pip install -r requirements.txtOpen retrieval_logic.py and update the configuration block at the top of the
file:
QDRANT_HOST = "localhost" # your Qdrant server
QDRANT_PORT = 6333
MODEL_PATH = "/path/to/Llama-3.1-8B-Instruct" # or HuggingFace repo ID
LORA_PATH_EXTR = "/path/to/snippet-extraction-lora"
LORA_PATH_ANSW = "/path/to/answer-generation-lora"
EMBEDDING_MODEL_PATH = "/path/to/bge-m3" export APP_USERNAME="your_username"
export APP_PASSWORD="your_password"If not set, the defaults (sagdemo / Demo2026!) are used.
python app.pyThe server starts on http://0.0.0.0:5000. Open it in your browser and
authenticate with the configured credentials.
demo1.mp4
demo2.mp4
LocalBioRAG/
├── app.py # Flask web server (routes, auth)
├── retrieval_logic.py # Hybrid retrieval, snippet extraction, answer generation
├── requirements.txt # Python dependencies
├── templates/
│ └── index.html # Main UI template
├── static/
│ ├── style.css # Stylesheet
│ └── script.js # Client-side logic (search, modal, Excel export)
├── assets/
│ └── architecture.png # Architecture diagram
└── README.md # This file
If you use this code or system in your research, please cite:
@InProceedings{10.1007/978-3-032-21324-2_31,
author="Borazio, Federico
and Labbate, Francesco
and Croce, Danilo
and Basili, Roberto",
editor="Campos, Ricardo
and Jatowt, Adam
and Lan, Yanyan
and Aliannejadi, Mohammad
and Bauer, Christine
and MacAvaney, Sean
and Anand, Avishek
and Ren, Zhaochun
and Verberne, Suzan
and Bai, Nan
and Mansoury, Masoud",
title="Integrating AI and IR Paradigms for Sustainable and Trustworthy Accurate Access to Large Scale Biomedical Information",
booktitle="Advances in Information Retrieval",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="398--412",
isbn="978-3-032-21324-2"
}This project is released under the MIT License.
Developed by SAG · University of Rome Tor Vergata
