NVIDIA · KyleZheng1284 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
@@ -290,6 +290,7 @@ DATASET_DIR=/custom/path uv run nv-ingest-harness-run --case=e2e
 | `e2e_with_llm_summary` | E2E with LLM summarization via UDF | `active` section only | ✅ Available (YAML config) |
 | `recall` | Recall evaluation against existing collections | `active` + `recall` sections | ✅ Available (YAML config) |
 | `e2e_recall` | Fresh ingestion + recall evaluation | `active` + `recall` sections | ✅ Available (YAML config) |
+| `qa_eval` | LLM answer quality evaluation (RAG) | `active` + `qa_eval` sections | ✅ Available (YAML config) |
 | `page_elements` | nemotron-page-elements-v3 model benchmarking (PyPi) | None | ✅ Available |
 | `graphic_elements` | nemotron-graphic-elements-v1 model benchmarking (PyPi) | None | ✅ Available |
 | `table_structure` | nemotron-table-structure-v1 model benchmarking (PyPi) | None | ✅ Available |
@@ -360,6 +361,31 @@ active:
   enable_caption: true
 ```
 
+## QA Evaluation
+
+QA evaluation measures **LLM answer quality** over the full RAG pipeline: retrieve context from a VDB, generate answers with one or more LLMs, and score answers against ground truth using **multi-tier scoring** (retrieval signal, programmatic metrics, LLM-as-judge). The recommended **bo767** workflow uses **full-page markdown** context for fair comparison with research baselines.
+
+The eval harness is **pluggable**: your retrieval stack (vector, hybrid, agentic, or anything else) only needs to emit a JSON file that matches the **specification** consumed by `run_qa_eval.py` (via `FileRetriever`). If the JSON shape and query keys line up with the ground-truth dataset, you can compare methods without changing the evaluator. See [`src/nv_ingest_harness/utils/qa/README.md`](src/nv_ingest_harness/utils/qa/README.md) for the full contract.
+
+**Full documentation** (reproduction commands, env vars, retrieval JSON specification, architecture, harness CLI notes): [`src/nv_ingest_harness/utils/qa/README.md`](src/nv_ingest_harness/utils/qa/README.md).
+
+**At a glance**
+
+- **Default ground truth (standalone scripts):** [`data/bo767_annotations.csv`](../../data/bo767_annotations.csv) at the repo root -- the **bo767 annotations subset** we maintain for this benchmark (multi-modality Q&A over the bo767 PDFs). `QA_CSV` / `QA_DATASET` default to this path (resolved relative to the repo root from `tools/harness`).
+- **Standalone scripts** in `tools/harness/`: `ingest_bo767.py`, `extract_bo767_parquet.py`, `build_page_markdown_index.py`, `export_retrieval_nemo.py`, `run_qa_eval.py`; optional `retrieve_and_export.py` when using the harness VDB stack.
+- **Eval requires** `RETRIEVAL_FILE` and `NVIDIA_API_KEY`; all other knobs are in the QA README.
+- **Full bo767 repro (ingest / LanceDB / NeMo Retriever):** requires a **Python 3.12 venv** with `nemo_retriever`, LanceDB, and `litellm`. See [Python environment](src/nv_ingest_harness/utils/qa/README.md#python-environment) in the QA README.
+
+**Harness CLI (alternative)** -- uses `test_configs.yaml`; dataset and retrieval file paths may differ from standalone defaults. See the QA README.
+
+```bash
+uv run nv-ingest-harness-run --case=e2e --dataset=bo767
+uv run python retrieve_and_export.py
+uv run nv-ingest-harness-run --case=qa_eval --dataset=bo767
+```
+
+**Retrieval JSON** -- minimal shape: a top-level `queries` object mapping each ground-truth question string to `{ "chunks": ["...", ...] }` (plus optional metadata). Meet the full specification in the QA README so `run_qa_eval.py` can load it unchanged.
+
 ## Recall Testing
 
 Recall testing evaluates retrieval accuracy against ground truth query sets. Two test cases are available:

@@ -0,0 +1,119 @@
+"""Build a page-level markdown index from extracted Parquet files.
+
+Loads extraction results saved by extract_bo767_parquet.py, groups records
+by (source document, page number), renders each page via to_markdown_by_page,
+and writes a JSON index mapping source_id -> page_number -> markdown.
+
+Usage:
+    python build_page_markdown_index.py
+
+Env vars:
+    PARQUET_DIR    Directory containing Parquet files (default: data/bo767_extracted)
+    OUTPUT_FILE    Where to write the JSON index (default: data/bo767_page_markdown.json)
+"""
+
+import json
+import os
+import sys
+import time
+from collections import defaultdict
+from pathlib import Path
+
+import numpy as np
+
+_HERE = os.path.dirname(os.path.abspath(__file__))
+
+
+def main() -> int:
+    parquet_dir = os.environ.get(
+        "PARQUET_DIR",
+        os.path.join(_HERE, "data", "bo767_extracted"),
+    )
+    output_file = os.environ.get(
+        "OUTPUT_FILE",
+        os.path.join(_HERE, "data", "bo767_page_markdown.json"),
+    )
+
+    print("=" * 60)
+    print("Build Page Markdown Index")
+    print("=" * 60)
+    print(f"Parquet dir:  {parquet_dir}")
+    print(f"Output file:  {output_file}")
+
+    if not os.path.isdir(parquet_dir):
+        print(f"ERROR: Parquet directory not found: {parquet_dir}", file=sys.stderr)
+        return 1
+
+    import pandas as pd
+    from nemo_retriever.io.markdown import to_markdown_by_page
+
+    parquet_files = sorted(Path(parquet_dir).rglob("*.parquet"))
+    if not parquet_files:
+        print(f"ERROR: No .parquet files found in {parquet_dir}", file=sys.stderr)
+        return 1
+
+    print(f"Found {len(parquet_files)} Parquet file(s)")
+
+    t0 = time.monotonic()
+    dfs = [pd.read_parquet(f) for f in parquet_files]
+    df = pd.concat(dfs, ignore_index=True)
+    print(f"Loaded {len(df)} records in {time.monotonic() - t0:.1f}s")
+    print(f"Columns: {list(df.columns)}")
+
+    path_col = "path" if "path" in df.columns else "source_id"
+    if path_col not in df.columns:
+        print("ERROR: Neither 'path' nor 'source_id' found in columns", file=sys.stderr)
+        return 1
+
+    def _ndarray_to_list(record: dict) -> dict:
+        """Pandas reads Parquet list columns as numpy arrays.
+        to_markdown_by_page checks isinstance(items, list), so convert them."""
+        for key in ("table", "chart", "infographic", "tables", "charts", "infographics"):
+            val = record.get(key)
+            if isinstance(val, np.ndarray):
+                record[key] = val.tolist()
+        return record
+
+    docs_grouped = defaultdict(list)
+    for _, row in df.iterrows():
+        source = str(row.get(path_col, ""))
+        if source:
+            docs_grouped[source].append(_ndarray_to_list(row.to_dict()))
+
+    print(f"Grouped into {len(docs_grouped)} documents")
+
+    t1 = time.monotonic()
+    index: dict[str, dict[str, str]] = {}
+    total_pages = 0
+
+    for source_id, records in docs_grouped.items():
+        try:
+            pages = to_markdown_by_page(records)
+        except Exception as exc:
+            print(f"  WARNING: Failed to render {source_id}: {exc}")
+            continue
+
+        page_map: dict[str, str] = {}
+        for page_number, markdown in pages.items():
+            page_map[str(page_number)] = markdown
+            total_pages += 1
+
+        index[source_id] = page_map
+
+    elapsed_render = time.monotonic() - t1
+    print(f"Rendered {total_pages} pages from {len(index)} documents in {elapsed_render:.1f}s")
+
+    os.makedirs(os.path.dirname(output_file), exist_ok=True)
+    with open(output_file, "w", encoding="utf-8") as f:
+        json.dump(index, f, ensure_ascii=False)
+
+    size_mb = os.path.getsize(output_file) / 1024 / 1024
+    print(f"\nIndex written to {output_file} ({size_mb:.1f} MB)")
+    print(f"  Documents: {len(index)}")
+    print(f"  Pages:     {total_pages}")
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())