-
Notifications
You must be signed in to change notification settings - Fork 0
Improve Chunking Strategy #48
Copy link
Copy link
Open
Description
Overview
The current chunking implementation uses a basic RecursiveCharacterTextSplitter with fixed parameters (chunk_size: 500, chunk_overlap: 10). This approach has known limitations that affect retrieval quality, particularly for documents with complex structure, anaphoric references, or mixed content types.
Current Implementation
// hybrid_fast_seed.ts, lines 2289-2292
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 10,
});Limitations of Current Approach:
- Fixed chunk size regardless of document structure or content type
- No awareness of semantic boundaries (paragraphs, sections, concepts)
- Poor handling of anaphoric references ("it", "the author", "this approach")
- Arbitrary splits can separate related information
- No differentiation between document types (books, papers, code)
Research Foundation
Reference: Choosing the Right Chunking Strategy: A Comprehensive Guide to RAG Optimization
Chunking Strategy Taxonomy
| Strategy | Description | Use Case |
|---|---|---|
| Early Chunking (current) | Split first, embed separately | Simple, fast, but loses context |
| Late Chunking | Embed full doc first, then apply chunk boundaries | Preserves full context, 10-12% accuracy improvement |
| Contextual Chunking | Add LLM-generated context prefix to each chunk | Works with any embedding API, 2-18% improvement |
| Adaptive Chunking | Respects semantic boundaries (sentences, paragraphs) | Better coherence for prose |
| Topic-Based Chunking | Groups content by semantic topic | Multi-topic documents |
| Entity-Based Chunking | Groups by entity mentions | Entity-focused queries |
| Code-Specific Chunking | Respects function/class boundaries | Source code documents |
Strategy Selection by Document Type
| Document Type | Recommended Strategy | Rationale |
|---|---|---|
| Technical books | Contextual + Adaptive | Cross-references, technical terms |
| Academic papers | Contextual Chunking | Citations, anaphoric references |
| Source code | Code-Specific Chunking | Function/class boundaries |
| Mixed content | Hybrid Chunking | Combines multiple approaches |
Success Metrics
| Metric | Current | Target |
|---|---|---|
| Retrieval accuracy (top-5) | Baseline | +10-15% |
| Anaphoric reference resolution | Poor | Good |
| Cross-section queries | Often fails | Improved |
| Processing speed | Fast | ≤20% slower acceptable |
References
- Choosing the Right Chunking Strategy (Dev.to)
- Late Chunking Research Paper
- Jina AI Late Chunking
- LangChain Text Splitters
- Current implementation:
hybrid_fast_seed.tslines 2289-2296
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels