QA eval pipeline for retrieval by KyleZheng1284 · Pull Request #1754 · NVIDIA/NeMo-Retriever

KyleZheng1284 · 2026-03-30T21:26:45Z

Description

Adds a pluggable QA evaluation harness for measuring Retrieval quality end-to-end using multi-tier scoring.

Capabilities:

Multi-tier scoring -- Tier 1 retrieval recall (answer-in-context), Tier 2 programmatic (exact match + token F1), and Tier 3 LLM-as-judge (1-5 rubric) run together in a single pass at zero extra retrieval cost.
Full-page markdown retrieval -- Reconstructs complete document pages from NeMo Retriever extraction records via to_markdown_by_page()
Pluggable retrieval -- Any retrieval system (vector search, agentic, hybrid, BM25) plugs in by producing a standard JSON (queries → chunks); no harness code changes required.
Pluggable datasets -- Any CSV with query/answer columns loads via csv:path/to/file.csv; default ground truth is data/bo767_annotations.csv (1007 Q&A pairs, all modalities).
Pluggable LLMs -- Generator and judge models swap via env var or YAML config using litellm prefix routing (nvidia_nim/, openai/, huggingface/).
Multi-model sweep -- Set GEN_MODELS to evaluate multiple generators in a single run with side-by-side score comparisons.
Failure classification -- Per-query categorization into correct, partial, retrieval_miss, generation_miss, no_context, thinking_truncated to pinpoint exactly where the pipeline fails.

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

)## Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

copy-pr-bot · 2026-03-30T21:26:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jperez999 · 2026-03-30T22:59:46Z

tools/harness/ingest_bo767.py

+
+    ingestor = create_ingestor(run_mode="batch")
+    ingestor = (
+        ingestor.files(documents)


How is this different from the normal batch_pipeline run? Why do we need to set this particular dataset and script here. The harness is not supposed to require more scripts to run. It should use batch_pipeline.py to run the graphs. Or we could run graphs directly.

thank you for pointing this out yes batch_pipeline.py does seem to cover the exact same functionality that this script does and some more

jperez999 · 2026-03-30T23:00:24Z

tools/harness/extract_bo767_parquet.py

+
+    from nemo_retriever import create_ingestor
+
+    ingestor = create_ingestor(run_mode="batch")


Again, same as other script, why do we need this, this can be dynamically created?

jperez999 · 2026-03-30T23:13:59Z

tools/harness/export_retrieval_nemo.py

+    print(f"  Page index key check: {matched}/{len(sampled)} sampled source_ids found")
+
+
+def main() -> int:


Why not make this a tool we can call via import, instead of a main function.

jperez999 · 2026-03-30T23:15:07Z

tools/harness/export_retrieval_nemo.py

+BATCH_SIZE = 50
+
+
+def load_queries(csv_path: str) -> list[dict]:


I feel like these tools would be good to have even outside the harness, no? Like if I run an evaluation, without the harness, I will need to use all these functions to actually get results right?

jperez999 · 2026-03-30T23:16:35Z

tools/harness/build_page_markdown_index.py

+_HERE = os.path.dirname(os.path.abspath(__file__))
+
+
+def main() -> int:


seems like this is also a tool, where I send it a parquet file (could be dataframe) and then you create the page level markdown. This definitely is useful outside of harness.

jperez999 · 2026-03-30T23:17:28Z

tools/harness/src/nv_ingest_harness/utils/qa/types.py

@@ -0,0 +1,117 @@
+"""
+Protocol definitions and dataclasses for the QA evaluation pipeline.


I think the tools/harness is not where we want our code, I think we want it in the nv-ingest/nemo_retriever/ sub folder.

jperez999 · 2026-03-30T23:18:55Z

tools/harness/src/nv_ingest_harness/utils/qa/orchestrator.py

+)
+
+
+class QAEvalPipeline:


We need to refactor this to work as an AbstractOperator type. Inherit from that and put everything this does in the appropriate life cycle hook.

jperez999 · 2026-03-30T23:22:27Z

tools/harness/src/nv_ingest_harness/utils/qa/judges.py

+Candidate answer: {candidate}"""
+
+
+class LLMJudge:


This should be an AbstractOperator. The process portion is the judge method. I should be able to pass a list of retrievals ( record in dataframe with query, golden_text and candidates) and the judge can provide a response for each record to judge.

jperez999 · 2026-03-30T23:29:42Z

tools/harness/src/nv_ingest_harness/utils/qa/generators.py

+    ]
+
+
+class LiteLLMClient:


This should be an AbstractOperator class. It should be able to run alone, Like if I pass it a dataframe with query, and context it should provide me an answer for that (basically your generate call, plumbed through process lifecycle hook.

QA eval pipeline with full-page markdown and multi-tier scoring

1887ca8

KyleZheng1284 requested review from a team as code owners March 30, 2026 21:26

KyleZheng1284 requested a review from nkmcalli March 30, 2026 21:26

Kyle Zheng added 3 commits March 30, 2026 21:28

removed stale ref

b5d330a

fixing ci/cd issues

c12f1b1

style: black formatting for QA harness files

b0db3ab

jperez999 requested changes Mar 30, 2026

View reviewed changes

update readme

d7c48fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA eval pipeline for retrieval#1754

QA eval pipeline for retrieval#1754
KyleZheng1284 wants to merge 5 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline

KyleZheng1284 commented Mar 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

KyleZheng1284 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		from nemo_retriever import create_ingestor

		ingestor = create_ingestor(run_mode="batch")

		print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found")


		def main() -> int:

		BATCH_SIZE = 50


		def load_queries(csv_path: str) -> list[dict]:

		_HERE = os.path.dirname(os.path.abspath(__file__))


		def main() -> int:

		@@ -0,0 +1,117 @@
		"""
		Protocol definitions and dataclasses for the QA evaluation pipeline.

		)


		class QAEvalPipeline:

		]


		class LiteLLMClient:

Conversation

KyleZheng1284 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Capabilities:

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KyleZheng1284 commented Mar 30, 2026 •

edited

Loading