Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions tools/harness/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ DATASET_DIR=/custom/path uv run nv-ingest-harness-run --case=e2e
| `e2e_with_llm_summary` | E2E with LLM summarization via UDF | `active` section only | ✅ Available (YAML config) |
| `recall` | Recall evaluation against existing collections | `active` + `recall` sections | ✅ Available (YAML config) |
| `e2e_recall` | Fresh ingestion + recall evaluation | `active` + `recall` sections | ✅ Available (YAML config) |
| `qa_eval` | LLM answer quality evaluation (RAG) | `active` + `qa_eval` sections | ✅ Available (YAML config) |
| `page_elements` | nemotron-page-elements-v3 model benchmarking (PyPi) | None | ✅ Available |
| `graphic_elements` | nemotron-graphic-elements-v1 model benchmarking (PyPi) | None | ✅ Available |
| `table_structure` | nemotron-table-structure-v1 model benchmarking (PyPi) | None | ✅ Available |
Expand Down Expand Up @@ -360,6 +361,31 @@ active:
enable_caption: true
```

## QA Evaluation

QA evaluation measures **LLM answer quality** over the full RAG pipeline: retrieve context from a VDB, generate answers with one or more LLMs, and score answers against ground truth using **multi-tier scoring** (retrieval signal, programmatic metrics, LLM-as-judge). The recommended **bo767** workflow uses **full-page markdown** context for fair comparison with research baselines.

The eval harness is **pluggable**: your retrieval stack (vector, hybrid, agentic, or anything else) only needs to emit a JSON file that matches the **specification** consumed by `run_qa_eval.py` (via `FileRetriever`). If the JSON shape and query keys line up with the ground-truth dataset, you can compare methods without changing the evaluator. See [`src/nv_ingest_harness/utils/qa/README.md`](src/nv_ingest_harness/utils/qa/README.md) for the full contract.

**Full documentation** (reproduction commands, env vars, retrieval JSON specification, architecture, harness CLI notes): [`src/nv_ingest_harness/utils/qa/README.md`](src/nv_ingest_harness/utils/qa/README.md).

**At a glance**

- **Default ground truth (standalone scripts):** [`data/bo767_annotations.csv`](../../data/bo767_annotations.csv) at the repo root -- the **bo767 annotations subset** we maintain for this benchmark (multi-modality Q&A over the bo767 PDFs). `QA_CSV` / `QA_DATASET` default to this path (resolved relative to the repo root from `tools/harness`).
- **Standalone scripts** in `tools/harness/`: `ingest_bo767.py`, `extract_bo767_parquet.py`, `build_page_markdown_index.py`, `export_retrieval_nemo.py`, `run_qa_eval.py`; optional `retrieve_and_export.py` when using the harness VDB stack.
- **Eval requires** `RETRIEVAL_FILE` and `NVIDIA_API_KEY`; all other knobs are in the QA README.
- **Full bo767 repro (ingest / LanceDB / NeMo Retriever):** requires a **Python 3.12 venv** with `nemo_retriever`, LanceDB, and `litellm`. See [Python environment](src/nv_ingest_harness/utils/qa/README.md#python-environment) in the QA README.

**Harness CLI (alternative)** -- uses `test_configs.yaml`; dataset and retrieval file paths may differ from standalone defaults. See the QA README.

```bash
uv run nv-ingest-harness-run --case=e2e --dataset=bo767
uv run python retrieve_and_export.py
uv run nv-ingest-harness-run --case=qa_eval --dataset=bo767
```

**Retrieval JSON** -- minimal shape: a top-level `queries` object mapping each ground-truth question string to `{ "chunks": ["...", ...] }` (plus optional metadata). Meet the full specification in the QA README so `run_qa_eval.py` can load it unchanged.

## Recall Testing

Recall testing evaluates retrieval accuracy against ground truth query sets. Two test cases are available:
Expand Down
119 changes: 119 additions & 0 deletions tools/harness/build_page_markdown_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""Build a page-level markdown index from extracted Parquet files.

Loads extraction results saved by extract_bo767_parquet.py, groups records
by (source document, page number), renders each page via to_markdown_by_page,
and writes a JSON index mapping source_id -> page_number -> markdown.

Usage:
python build_page_markdown_index.py

Env vars:
PARQUET_DIR Directory containing Parquet files (default: data/bo767_extracted)
OUTPUT_FILE Where to write the JSON index (default: data/bo767_page_markdown.json)
"""

import json
import os
import sys
import time
from collections import defaultdict
from pathlib import Path

import numpy as np

_HERE = os.path.dirname(os.path.abspath(__file__))


def main() -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is also a tool, where I send it a parquet file (could be dataframe) and then you create the page level markdown. This definitely is useful outside of harness.

parquet_dir = os.environ.get(
"PARQUET_DIR",
os.path.join(_HERE, "data", "bo767_extracted"),
)
output_file = os.environ.get(
"OUTPUT_FILE",
os.path.join(_HERE, "data", "bo767_page_markdown.json"),
)

print("=" * 60)
print("Build Page Markdown Index")
print("=" * 60)
print(f"Parquet dir: {parquet_dir}")
print(f"Output file: {output_file}")

if not os.path.isdir(parquet_dir):
print(f"ERROR: Parquet directory not found: {parquet_dir}", file=sys.stderr)
return 1

import pandas as pd
from nemo_retriever.io.markdown import to_markdown_by_page

parquet_files = sorted(Path(parquet_dir).rglob("*.parquet"))
if not parquet_files:
print(f"ERROR: No .parquet files found in {parquet_dir}", file=sys.stderr)
return 1

print(f"Found {len(parquet_files)} Parquet file(s)")

t0 = time.monotonic()
dfs = [pd.read_parquet(f) for f in parquet_files]
df = pd.concat(dfs, ignore_index=True)
print(f"Loaded {len(df)} records in {time.monotonic() - t0:.1f}s")
print(f"Columns: {list(df.columns)}")

path_col = "path" if "path" in df.columns else "source_id"
if path_col not in df.columns:
print("ERROR: Neither 'path' nor 'source_id' found in columns", file=sys.stderr)
return 1

def _ndarray_to_list(record: dict) -> dict:
"""Pandas reads Parquet list columns as numpy arrays.
to_markdown_by_page checks isinstance(items, list), so convert them."""
for key in ("table", "chart", "infographic", "tables", "charts", "infographics"):
val = record.get(key)
if isinstance(val, np.ndarray):
record[key] = val.tolist()
return record

docs_grouped = defaultdict(list)
for _, row in df.iterrows():
source = str(row.get(path_col, ""))
if source:
docs_grouped[source].append(_ndarray_to_list(row.to_dict()))

print(f"Grouped into {len(docs_grouped)} documents")

t1 = time.monotonic()
index: dict[str, dict[str, str]] = {}
total_pages = 0

for source_id, records in docs_grouped.items():
try:
pages = to_markdown_by_page(records)
except Exception as exc:
print(f" WARNING: Failed to render {source_id}: {exc}")
continue

page_map: dict[str, str] = {}
for page_number, markdown in pages.items():
page_map[str(page_number)] = markdown
total_pages += 1

index[source_id] = page_map

elapsed_render = time.monotonic() - t1
print(f"Rendered {total_pages} pages from {len(index)} documents in {elapsed_render:.1f}s")

os.makedirs(os.path.dirname(output_file), exist_ok=True)
with open(output_file, "w", encoding="utf-8") as f:
json.dump(index, f, ensure_ascii=False)

size_mb = os.path.getsize(output_file) / 1024 / 1024
print(f"\nIndex written to {output_file} ({size_mb:.1f} MB)")
print(f" Documents: {len(index)}")
print(f" Pages: {total_pages}")

return 0


if __name__ == "__main__":
sys.exit(main())
Loading
Loading