A lightweight benchmark to compare three multimodal RAG retrieval patterns on the same corpus.
This repo distills ideas from small open-source multimodal RAG demos and turns them into a reproducible benchmark:
- Strategy A (
raw_multimodal): retrieve with text + table + image summary signals directly. - Strategy B (
summary_first): convert all modalities into text-like summaries, then retrieve as pure text. - Strategy C (
hybrid_rerank): summary-first coarse retrieval, then multimodal rerank.
This mirrors common industry choices when building practical multimodal RAG systems for PDFs/reports.
- Pure Python, no heavy model dependency.
- Supports JSONL corpora with text/table/image_summary fields.
- Built-in retrieval metrics:
Recall@KandMRR. - CLI for search and evaluation.
python -m venv .venv
source .venv/bin/activate
pip install -e .Each line of corpus JSONL:
{
"id": "doc-1",
"source": "paper-A.pdf#p2",
"text": "...",
"table": "...",
"image_summary": "...",
"image_tags": ["chart", "accuracy"]
}QA file (JSON list):
[
{
"question": "Which model gets the best OCR F1 in 2025 benchmark?",
"gold_doc_ids": ["doc-2"]
}
]Run a search:
mmrag-bench search \
--corpus data/sample_corpus.jsonl \
--query "Which model has the best OCR F1?" \
--strategy hybrid_rerank \
--top-k 3Run evaluation:
mmrag-bench eval \
--corpus data/sample_corpus.jsonl \
--qa data/sample_qa.json \
--top-k 3strategy=raw_multimodal recall@3=1.0000 mrr=1.0000
strategy=summary_first recall@3=1.0000 mrr=0.8333
strategy=hybrid_rerank recall@3=1.0000 mrr=1.0000
src/mmrag_strategy_bench/core packagedata/sample corpus and sample QA settests/sanity tests
Conceptually inspired by:
This implementation is original and intentionally lightweight for educational benchmarking.
MIT