Skip to content

Improve Chunking Strategy #48

@m2ux

Description

@m2ux

Overview

The current chunking implementation uses a basic RecursiveCharacterTextSplitter with fixed parameters (chunk_size: 500, chunk_overlap: 10). This approach has known limitations that affect retrieval quality, particularly for documents with complex structure, anaphoric references, or mixed content types.

Current Implementation

// hybrid_fast_seed.ts, lines 2289-2292
const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 500,
    chunkOverlap: 10,
});

Limitations of Current Approach:

  • Fixed chunk size regardless of document structure or content type
  • No awareness of semantic boundaries (paragraphs, sections, concepts)
  • Poor handling of anaphoric references ("it", "the author", "this approach")
  • Arbitrary splits can separate related information
  • No differentiation between document types (books, papers, code)

Research Foundation

Reference: Choosing the Right Chunking Strategy: A Comprehensive Guide to RAG Optimization

Chunking Strategy Taxonomy

Strategy Description Use Case
Early Chunking (current) Split first, embed separately Simple, fast, but loses context
Late Chunking Embed full doc first, then apply chunk boundaries Preserves full context, 10-12% accuracy improvement
Contextual Chunking Add LLM-generated context prefix to each chunk Works with any embedding API, 2-18% improvement
Adaptive Chunking Respects semantic boundaries (sentences, paragraphs) Better coherence for prose
Topic-Based Chunking Groups content by semantic topic Multi-topic documents
Entity-Based Chunking Groups by entity mentions Entity-focused queries
Code-Specific Chunking Respects function/class boundaries Source code documents

Strategy Selection by Document Type

Document Type Recommended Strategy Rationale
Technical books Contextual + Adaptive Cross-references, technical terms
Academic papers Contextual Chunking Citations, anaphoric references
Source code Code-Specific Chunking Function/class boundaries
Mixed content Hybrid Chunking Combines multiple approaches

Success Metrics

Metric Current Target
Retrieval accuracy (top-5) Baseline +10-15%
Anaphoric reference resolution Poor Good
Cross-section queries Often fails Improved
Processing speed Fast ≤20% slower acceptable

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions