Name	Name	Last commit message	Last commit date
parent directory ..
.agentv	.agentv
evals	evals
fixtures	fixtures
graders	graders
lib	lib
scripts	scripts
.gitignore	.gitignore
README.md	README.md
mock_extractor.ts	mock_extractor.ts
package.json	package.json

Document Extraction Example

This folder demonstrates two evaluation patterns for document extraction:

field_accuracy (built-in) - Per-evalcase scoring with pass/fail per field
code_grader (custom) - TP/TN/FP/FN metrics for cross-document aggregation

When to Use Each Pattern

Pattern	Use Case	Output
`field_accuracy`	Simple pass/fail scoring per evalcase	Score (0-1) per evalcase
`code_grader` with `details.metrics`	Aggregate precision/recall across documents	TP/TN/FP/FN per field

Quick Start

From repo root:

# Pattern 1: Field accuracy (per-evalcase scoring)
bun agentv eval examples/features/document-extraction/evals/field-accuracy.eval.yaml

# Pattern 2: Confusion metrics (cross-document aggregation)
bun agentv eval examples/features/document-extraction/evals/confusion-metrics.eval.yaml

# Aggregate TP/TN/FP/FN into a table (only works with confusion-metrics.eval.yaml)
bun run examples/features/document-extraction/scripts/aggregate_metrics.ts \
  .agentv/results/eval_<timestamp>.jsonl

Pattern 1: Field Accuracy (`field-accuracy.eval.yaml`)

Uses the built-in field_accuracy evaluator for per-evalcase scoring:

evaluators:
  - name: invoice_field_accuracy
    type: field-accuracy
    fields:
      - path: invoice_number
        match: exact
        required: true
      - path: invoice_date
        match: date
      - path: net_total
        match: numeric_tolerance
        tolerance: 1.0

Output: A score (0-1) per evalcase based on weighted field matches.

Best for: Quick validation, CI/CD gates, simple pass/fail checks.

Pattern 2: Confusion Metrics (`confusion-metrics.eval.yaml`)

Uses a custom code_grader that emits details.metrics with TP/TN/FP/FN per field:

evaluators:
  - name: header_confusion
    type: code-grader
    command: ["bun", "run", "../graders/header_confusion_metrics.ts"]
    fields:
      - path: invoice_number
      - path: currency
      - path: supplier.name

Key requirement: All cases must use the same evaluator with the same fields to enable cross-document aggregation.

Output: Aggregate metrics table with fractional precision/recall:

Processed 5 evaluation results from .agentv/results/eval_<timestamp>.jsonl

Field          | TP | TN | FP | FN | Precision | Recall | F1    | Count
---------------+----+----+----+----+-----------+--------+-------+------
currency       | 4  | 0  | 1  | 1  | 0.800     | 0.800  | 0.800 | 5
gross_total    | 3  | 0  | 1  | 2  | 0.750     | 0.600  | 0.667 | 5
importer.name  | 5  | 0  | 0  | 0  | 1.000     | 1.000  | 1.000 | 5
invoice_date   | 5  | 0  | 0  | 0  | 1.000     | 1.000  | 1.000 | 5
invoice_number | 4  | 0  | 0  | 1  | 1.000     | 0.800  | 0.889 | 5
supplier.name  | 1  | 0  | 4  | 4  | 0.200     | 0.200  | 0.200 | 5

Total: TP=22 TN=0 FP=6 FN=8
Micro-Precision: 0.786
Micro-Recall: 0.733
Micro-F1: 0.759
Macro-F1: 0.759

Best for: Measuring extraction accuracy across a document corpus, comparing model versions.

Aggregate Metrics Script

The aggregate_metrics.ts script only works with evaluators that emit details.metrics:

bun run scripts/aggregate_metrics.ts results.jsonl [options]

Options:
  --evaluator <name>  Filter to a specific evaluator
  --format csv        Output as CSV instead of table

Where To Look

Datasets:
- evals/field-accuracy.eval.yaml - Field accuracy patterns
- evals/confusion-metrics.eval.yaml - TP/TN/FP/FN aggregation
Target: mock_extractor.ts
Fixtures: fixtures/
Graders:
- graders/header_confusion_metrics.ts - Emits TP/TN/FP/FN
- graders/fuzzy_match.ts, graders/multi_field_fuzzy.ts - Fuzzy matching
- graders/line_item_matching.ts - Array matching
Scripts: scripts/aggregate_metrics.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Document Extraction Example

When to Use Each Pattern

Quick Start

Pattern 1: Field Accuracy (`field-accuracy.eval.yaml`)

Pattern 2: Confusion Metrics (`confusion-metrics.eval.yaml`)

Aggregate Metrics Script

Where To Look

FilesExpand file tree

document-extraction

Directory actions

More options

Directory actions

More options

Latest commit

History

document-extraction

Folders and files

parent directory

README.md

Document Extraction Example

When to Use Each Pattern

Quick Start

Pattern 1: Field Accuracy (field-accuracy.eval.yaml)

Pattern 2: Confusion Metrics (confusion-metrics.eval.yaml)

Aggregate Metrics Script

Where To Look

Pattern 1: Field Accuracy (`field-accuracy.eval.yaml`)

Pattern 2: Confusion Metrics (`confusion-metrics.eval.yaml`)