-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
Problem
Some documents are too large to process in a single LLM call, even with truncation. Need multi-pass extraction for comprehensive results.
Dependencies
- Requires: Add basic token counting and budget warnings #39 (token counting)
- Requires: Add smart input truncation and schema simplification #40 (truncation strategies)
Proposed Solution
Implement chunking and multi-pass extraction:
-
Intelligent chunking
- Split by semantic boundaries (paragraphs, sections)
- Maintain overlap for context (e.g., 100 tokens)
- Preserve document structure
-
Multi-pass extraction
- Extract from each chunk independently
- Merge results intelligently:
- Deduplicate extracted entities
- Resolve conflicts (use confidence scores)
- Combine arrays/lists
- Track chunk provenance
-
Configuration
interface ChunkingConfig {
enabled: boolean;
chunkSize: number; // tokens per chunk
overlap: number; // token overlap between chunks
mergeStrategy: 'concat' | 'dedupe' | 'smart';
}- Output metadata
- Report chunks processed
- Show merge conflicts resolved
- Confidence scores per field
Acceptance Criteria
- Chunking algorithm implementation
- Multi-pass extraction pipeline
- Result merging with deduplication
- Conflict resolution strategy
- CLI flag:
--enable-chunking - CLI flag:
--chunk-size N - Metadata in extraction results
- Tests for various document sizes
- Documentation with examples
- Performance benchmarks
Related
- Parent: Improve LLM prompt engineering and clarity #30 (closed - split into focused issues)
- Prerequisites: Add basic token counting and budget warnings #39, Add smart input truncation and schema simplification #40
- Note: This is the most complex feature - consider optional/experimental flag
Reactions are currently unavailable