Skip to content

QA eval pipeline for retrieval#1754

Open
KyleZheng1284 wants to merge 5 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline
Open

QA eval pipeline for retrieval#1754
KyleZheng1284 wants to merge 5 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline

Conversation

@KyleZheng1284
Copy link
Copy Markdown
Member

@KyleZheng1284 KyleZheng1284 commented Mar 30, 2026

Description

  • Adds a pluggable QA evaluation harness for measuring Retrieval quality end-to-end using multi-tier scoring.

Capabilities:

  • Multi-tier scoring -- Tier 1 retrieval recall (answer-in-context), Tier 2 programmatic (exact match + token F1), and Tier 3 LLM-as-judge (1-5 rubric) run together in a single pass at zero extra retrieval cost.
  • Full-page markdown retrieval -- Reconstructs complete document pages from NeMo Retriever extraction records via to_markdown_by_page()
  • Pluggable retrieval -- Any retrieval system (vector search, agentic, hybrid, BM25) plugs in by producing a standard JSON (queries → chunks); no harness code changes required.
  • Pluggable datasets -- Any CSV with query/answer columns loads via csv:path/to/file.csv; default ground truth is data/bo767_annotations.csv (1007 Q&A pairs, all modalities).
  • Pluggable LLMs -- Generator and judge models swap via env var or YAML config using litellm prefix routing (nvidia_nim/, openai/, huggingface/).
  • Multi-model sweep -- Set GEN_MODELS to evaluate multiple generators in a single run with side-by-side score comparisons.
  • Failure classification -- Per-query categorization into correct, partial, retrieval_miss, generation_miss, no_context, thinking_truncated to pinpoint exactly where the pipeline fails.

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

)## Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@KyleZheng1284 KyleZheng1284 requested review from a team as code owners March 30, 2026 21:26
@KyleZheng1284 KyleZheng1284 requested a review from nkmcalli March 30, 2026 21:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


ingestor = create_ingestor(run_mode="batch")
ingestor = (
ingestor.files(documents)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the normal batch_pipeline run? Why do we need to set this particular dataset and script here. The harness is not supposed to require more scripts to run. It should use batch_pipeline.py to run the graphs. Or we could run graphs directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for pointing this out yes batch_pipeline.py does seem to cover the exact same functionality that this script does and some more


from nemo_retriever import create_ingestor

ingestor = create_ingestor(run_mode="batch")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, same as other script, why do we need this, this can be dynamically created?

print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found")


def main() -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a tool we can call via import, instead of a main function.

BATCH_SIZE = 50


def load_queries(csv_path: str) -> list[dict]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these tools would be good to have even outside the harness, no? Like if I run an evaluation, without the harness, I will need to use all these functions to actually get results right?

_HERE = os.path.dirname(os.path.abspath(__file__))


def main() -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is also a tool, where I send it a parquet file (could be dataframe) and then you create the page level markdown. This definitely is useful outside of harness.

@@ -0,0 +1,117 @@
"""
Protocol definitions and dataclasses for the QA evaluation pipeline.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tools/harness is not where we want our code, I think we want it in the nv-ingest/nemo_retriever/ sub folder.

)


class QAEvalPipeline:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor this to work as an AbstractOperator type. Inherit from that and put everything this does in the appropriate life cycle hook.

Candidate answer: {candidate}"""


class LLMJudge:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an AbstractOperator. The process portion is the judge method. I should be able to pass a list of retrievals ( record in dataframe with query, golden_text and candidates) and the judge can provide a response for each record to judge.

]


class LiteLLMClient:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an AbstractOperator class. It should be able to run alone, Like if I pass it a dataframe with query, and context it should provide me an answer for that (basically your generate call, plumbed through process lifecycle hook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants