-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
Feature Request
Summary
Include evidence spans in extraction output to link each extracted field to its source location in the input text. This provides character offsets and text snippets showing where data was extracted from.
Motivation
- Verification: Users can verify extracted data by reviewing the source text
- Debugging: Easier to identify why extraction succeeded or failed for specific fields
- Transparency: Clear provenance showing which text led to which extracted values
- Auditing: Critical for applications requiring traceability (legal, medical, financial documents)
Proposed Output Format
{
"data": {
"invoice_number": "INV-2024-001",
"total": 1250.00,
"date": "2024-01-15"
},
"evidence": {
"invoice_number": {
"text": "Invoice Number: INV-2024-001",
"start": 45,
"end": 75
},
"total": {
"text": "Total Amount: ,250.00",
"start": 234,
"end": 257
},
"date": {
"text": "Date: January 15, 2024",
"start": 12,
"end": 34
}
}
}Implementation Considerations
- LLM prompt changes: Instruct model to return both extracted value and source span
- Validation: Verify evidence spans are valid (within input bounds, non-overlapping)
- Schema extension: Add optional
includeEvidenceflag to schema definitions - Performance: Minimal overhead since LLMs can provide this in same response
- Backward compatibility: Make evidence optional to avoid breaking existing users
Example Use Cases
- Legal contracts: Link extracted clauses to exact paragraph locations
- Medical records: Trace diagnosis codes to supporting text
- Financial documents: Verify amounts and dates from source
- Quality assurance: Automated review of extraction accuracy
Technical Approach
Option 1: Extend LLM prompt to request evidence with each field
Option 2: Post-processing fuzzy match to find source spans (less reliable)
Option 3: Structured output format where each field includes value and evidence
Recommendation: Option 1 or 3 for deterministic results aligned with project principles.
Related Work
- spaCy's entity spans with character offsets
- Information extraction systems with provenance tracking
- PDF extraction tools with bounding boxes
Reactions are currently unavailable