Skip to content

Add multi-pass extraction for large inputs #41

@michaeldistel

Description

@michaeldistel

Problem

Some documents are too large to process in a single LLM call, even with truncation. Need multi-pass extraction for comprehensive results.

Dependencies

Proposed Solution

Implement chunking and multi-pass extraction:

  1. Intelligent chunking

    • Split by semantic boundaries (paragraphs, sections)
    • Maintain overlap for context (e.g., 100 tokens)
    • Preserve document structure
  2. Multi-pass extraction

    • Extract from each chunk independently
    • Merge results intelligently:
      • Deduplicate extracted entities
      • Resolve conflicts (use confidence scores)
      • Combine arrays/lists
    • Track chunk provenance
  3. Configuration

interface ChunkingConfig {
  enabled: boolean;
  chunkSize: number;      // tokens per chunk
  overlap: number;        // token overlap between chunks
  mergeStrategy: 'concat' | 'dedupe' | 'smart';
}
  1. Output metadata
    • Report chunks processed
    • Show merge conflicts resolved
    • Confidence scores per field

Acceptance Criteria

  • Chunking algorithm implementation
  • Multi-pass extraction pipeline
  • Result merging with deduplication
  • Conflict resolution strategy
  • CLI flag: --enable-chunking
  • CLI flag: --chunk-size N
  • Metadata in extraction results
  • Tests for various document sizes
  • Documentation with examples
  • Performance benchmarks

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions