Skip to content

Add evidence spans to extraction output #53

@michaeldistel

Description

@michaeldistel

Feature Request

Summary

Include evidence spans in extraction output to link each extracted field to its source location in the input text. This provides character offsets and text snippets showing where data was extracted from.

Motivation

  • Verification: Users can verify extracted data by reviewing the source text
  • Debugging: Easier to identify why extraction succeeded or failed for specific fields
  • Transparency: Clear provenance showing which text led to which extracted values
  • Auditing: Critical for applications requiring traceability (legal, medical, financial documents)

Proposed Output Format

{
  "data": {
    "invoice_number": "INV-2024-001",
    "total": 1250.00,
    "date": "2024-01-15"
  },
  "evidence": {
    "invoice_number": {
      "text": "Invoice Number: INV-2024-001",
      "start": 45,
      "end": 75
    },
    "total": {
      "text": "Total Amount: ,250.00",
      "start": 234,
      "end": 257
    },
    "date": {
      "text": "Date: January 15, 2024",
      "start": 12,
      "end": 34
    }
  }
}

Implementation Considerations

  1. LLM prompt changes: Instruct model to return both extracted value and source span
  2. Validation: Verify evidence spans are valid (within input bounds, non-overlapping)
  3. Schema extension: Add optional includeEvidence flag to schema definitions
  4. Performance: Minimal overhead since LLMs can provide this in same response
  5. Backward compatibility: Make evidence optional to avoid breaking existing users

Example Use Cases

  • Legal contracts: Link extracted clauses to exact paragraph locations
  • Medical records: Trace diagnosis codes to supporting text
  • Financial documents: Verify amounts and dates from source
  • Quality assurance: Automated review of extraction accuracy

Technical Approach

Option 1: Extend LLM prompt to request evidence with each field
Option 2: Post-processing fuzzy match to find source spans (less reliable)
Option 3: Structured output format where each field includes value and evidence

Recommendation: Option 1 or 3 for deterministic results aligned with project principles.

Related Work

  • spaCy's entity spans with character offsets
  • Information extraction systems with provenance tracking
  • PDF extraction tools with bounding boxes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions