Skip to content

crux82/LocalBioRag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LocalBioRAG

Local biomedical retrieval and grounded question answering over PubMed.

Companion repository for the paper: Integrating AI and IR Paradigms for Sustainable and Trustworthy Accurate Access to Large Scale Biomedical Information — presented at ECIR 2026, Delft, The Netherlands.

LocalBioRAG architecture diagram


Table of Contents

  1. Overview
  2. What This Repository Provides
  3. Architecture
  4. Data: PubMed Processing and Qdrant Indexing
  5. Models
  6. Installation and Setup
  7. Running the Application
  8. Demo
  9. Repository Structure
  10. Citation
  11. License

Overview

Accessing reliable biomedical information at scale is a critical challenge. Researchers and clinicians need precise, evidence-grounded answers to their questions, drawn from the vast and continuously growing body of published literature. General-purpose LLMs can hallucinate facts; purely keyword-based search engines return documents but do not extract or synthesise the answer.

LocalBioRAG bridges this gap. It is a fully local, end-to-end Retrieval-Augmented Generation (RAG) system designed for biomedical question answering over the entire PubMed corpus (~38 million abstracts). Unlike cloud-based solutions, the complete pipeline — retrieval, passage extraction, and answer generation — runs on your own infrastructure with no external API calls, ensuring data privacy, reproducibility, and full control over every component.


What This Repository Provides

What Why it matters
Complete application code A working Flask web app that you can deploy on a GPU server and use immediately.
Hybrid retrieval pipeline Combines BM25 sparse search with BGE-M3 dense re-ranking for high-recall, high-precision document retrieval.
LLM-based snippet extraction A fine-tuned LoRA adapter on Llama 3.1 8B extracts exact supporting passages from each retrieved abstract.
Evidence-grounded answer generation A second LoRA adapter generates concise answers using a context-windowing strategy that preserves snippet provenance.
Indexing guidance Step-by-step instructions to reproduce the full PubMed index in Qdrant, so you can rebuild the system from scratch.
Web UI A clean, responsive interface with example queries, Excel export, and a loading experience designed for live demos.

Architecture

The pipeline consists of three stages executed sequentially for each user query:

Stage 1 — Hybrid Retrieval

  1. The user query is sent to Qdrant which performs a BM25 sparse search and returns the top 100 candidate documents.
  2. Each candidate already carries a pre-computed BGE-M3 dense vector. The query is encoded with the same model and cosine similarity is computed against all 100 candidates.
  3. A hybrid score (BM25_normalised × cosine_similarity) is calculated. The top 10 documents are selected.

Stage 2 — Snippet Extraction

  1. For each of the 10 documents, both title and abstract are passed to Llama 3.1 8B Instruct equipped with a LoRA adapter fine-tuned for snippet extraction (trained on BioASQ data).
  2. The model returns exact text spans enclosed in [BS] / [ES] tags. Documents without any extracted snippet are filtered out.

Stage 3 — Answer Generation (on demand)

  1. The user can request an evidence-grounded answer. The top 3 documents are formatted using a context-windowing strategy: for each snippet, up to 20 words before and 10 words after are kept from the original abstract.
  2. A second LoRA adapter (multi-task, also trained on BioASQ) generates a concise (≤ 200 words) answer grounded in the extracted evidence.

Data: PubMed Processing and Qdrant Indexing

To reproduce the full system you need to index PubMed into a Qdrant collection. Below are the detailed steps.

1. Download PubMed

PubMed distributes its complete set of abstracts as XML files via the NLM FTP server:

Download all .xml.gz files from the baseline directory. Each file contains thousands of <PubmedArticle> records.

2. Parse and extract fields

From each record, extract the following fields:

Field name Source in XML Description
pubmed_id <PMID> Unique PubMed identifier
article_title <ArticleTitle> Title of the article
abstract_text <AbstractText> Full abstract (concatenate all <AbstractText> elements)
year <PubDate><Year> Publication year

3. Compute dense embeddings

Title and abstract must be encoded separately with BGE-M3 and then combined into a single 1024-dimensional vector using a weighted average (0.2 × title + 0.8 × abstract):

import numpy as np
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

title_emb = model.encode([article_title], return_dense=True,
                         return_sparse=False, return_colbert_vecs=False)
abstract_emb = model.encode([abstract_text], return_dense=True,
                            return_sparse=False, return_colbert_vecs=False)

dense_vector = (
    0.2 * np.array(title_emb["dense_vecs"][0])
  + 0.8 * np.array(abstract_emb["dense_vecs"][0])
).tolist()   # list of 1024 floats

Why a weighted average? Biomedical abstracts carry the bulk of the factual content, while titles are shorter and more general. The 80/20 split reflects this asymmetry and was found to yield better retrieval quality on BioASQ evaluation data.

4. Create the Qdrant collection

Install Qdrant following the official documentation. A Docker deployment is the easiest path:

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

Create a collection with the required vector configuration. The collection must be named pubmed_BGE (or update the COLLECTION_NAME constant in retrieval_logic.py):

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="pubmed_BGE",
    vectors_config={
        "dense_vector_BGE": models.VectorParams(
            size=1024,
            distance=models.Distance.COSINE,
            on_disk=True,            # recommended for ~38M vectors
        ),
    },
    sparse_vectors_config={
        "sparse_vector": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        ),
    },
)

5. Index documents

Upload each record as a Qdrant point containing:

  • The pre-computed dense vector under "dense_vector_BGE".
  • A sparse BM25 vector under "sparse_vector", computed server-side by Qdrant. Pass the concatenated title + abstract as a models.Document with model="Qdrant/bm25". The avg_len option sets the average document length (in characters) used by BM25 length normalisation; the value 792.69 was measured on the full PubMed corpus.
  • The payload fields needed by the application.
import uuid

AVG_DOCUMENT_LENGTH = 792.69   # avg chars of title + abstract across PubMed

full_text = f"{article_title} {abstract_text}"

client.upsert(
    collection_name="pubmed_BGE",
    points=[
        models.PointStruct(
            id=str(uuid.uuid4()),
            vector={
                "dense_vector_BGE": dense_vector,
                "sparse_vector": models.Document(
                    text=full_text,
                    model="Qdrant/bm25",
                    options={"avg_len": AVG_DOCUMENT_LENGTH},
                ),
            },
            payload={
                "pubmed_id": pubmed_id,
                "article_title": article_title,
                "abstract_text": abstract_text,
                "year": year,
            },
        )
    ],
)

Tip: Process files in batches (e.g. 100 points per upsert call) and keep a checkpoint of processed files to allow resuming after interruptions. The full PubMed baseline (~38 M records) can take several hours to index.

Summary of required Qdrant schema:

Type Name Details
Named dense vector dense_vector_BGE 1024-dim, cosine distance, on-disk (BGE-M3 weighted avg)
Named sparse vector sparse_vector BM25 with IDF modifier (avg_len = 792.69)
Payload field pubmed_id string
Payload field article_title string
Payload field abstract_text string
Payload field year string

Models

Base model

Model Link
Llama 3.1 8B Instruct unsloth/Llama-3.1-8B-Instruct

Embedding model

Model Link
BGE-M3 BAAI/bge-m3

LoRA adapters (fine-tuned on BioASQ)

Adapter Purpose Link
Snippet Extraction LoRA Extracts relevant passages from titles and abstracts sag-uniroma2/bioasq-snippet-extraction-lora
Answer Generation LoRA Generates evidence-grounded answers from extracted passages sag-uniroma2/bioasq-answer-generation-lora

Note: Replace the placeholder links above with the actual Hugging Face repository URLs once the adapters are published.


Installation and Setup

Prerequisites

  • Python 3.10+
  • A CUDA-capable GPU (recommended: ≥ 24 GB VRAM for Llama 3.1 8B with LoRA)
  • A running Qdrant instance with the indexed PubMed collection (see above)

Install dependencies

git clone https://github.com/your-org/LocalBioRAG.git
cd LocalBioRAG
pip install -r requirements.txt

Configure paths

Open retrieval_logic.py and update the configuration block at the top of the file:

QDRANT_HOST = "localhost"                          # your Qdrant server
QDRANT_PORT = 6333
MODEL_PATH  = "/path/to/Llama-3.1-8B-Instruct"    # or HuggingFace repo ID
LORA_PATH_EXTR = "/path/to/snippet-extraction-lora"
LORA_PATH_ANSW = "/path/to/answer-generation-lora"
EMBEDDING_MODEL_PATH = "/path/to/bge-m3"           

(Optional) Set authentication credentials

export APP_USERNAME="your_username"
export APP_PASSWORD="your_password"

If not set, the defaults (sagdemo / Demo2026!) are used.


Running the Application

python app.py

The server starts on http://0.0.0.0:5000. Open it in your browser and authenticate with the configured credentials.


Demo

demo1.mp4
demo2.mp4

Repository Structure

LocalBioRAG/
├── app.py                  # Flask web server (routes, auth)
├── retrieval_logic.py      # Hybrid retrieval, snippet extraction, answer generation
├── requirements.txt        # Python dependencies
├── templates/
│   └── index.html          # Main UI template
├── static/
│   ├── style.css           # Stylesheet
│   └── script.js           # Client-side logic (search, modal, Excel export)
├── assets/
│   └── architecture.png    # Architecture diagram
└── README.md               # This file

Citation

If you use this code or system in your research, please cite:

@InProceedings{10.1007/978-3-032-21324-2_31,
author="Borazio, Federico
and Labbate, Francesco
and Croce, Danilo
and Basili, Roberto",
editor="Campos, Ricardo
and Jatowt, Adam
and Lan, Yanyan
and Aliannejadi, Mohammad
and Bauer, Christine
and MacAvaney, Sean
and Anand, Avishek
and Ren, Zhaochun
and Verberne, Suzan
and Bai, Nan
and Mansoury, Masoud",
title="Integrating AI and IR Paradigms for Sustainable and Trustworthy Accurate Access to Large Scale Biomedical Information",
booktitle="Advances in Information Retrieval",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="398--412",
isbn="978-3-032-21324-2"
}

License

This project is released under the MIT License.


Developed by SAG · University of Rome Tor Vergata

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors