Improve Chunking Strategy

## Overview

The current chunking implementation uses a basic `RecursiveCharacterTextSplitter` with fixed parameters (chunk_size: 500, chunk_overlap: 10). This approach has known limitations that affect retrieval quality, particularly for documents with complex structure, anaphoric references, or mixed content types.

## Current Implementation

```typescript
// hybrid_fast_seed.ts, lines 2289-2292
const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 500,
    chunkOverlap: 10,
});
```

**Limitations of Current Approach:**
- Fixed chunk size regardless of document structure or content type
- No awareness of semantic boundaries (paragraphs, sections, concepts)
- Poor handling of anaphoric references ("it", "the author", "this approach")
- Arbitrary splits can separate related information
- No differentiation between document types (books, papers, code)

## Research Foundation

Reference: [Choosing the Right Chunking Strategy: A Comprehensive Guide to RAG Optimization](https://dev.to/vishalmysore/choosing-the-right-chunking-strategy-a-comprehensive-guide-to-rag-optimization-4nan)

### Chunking Strategy Taxonomy

| Strategy | Description | Use Case |
|----------|-------------|----------|
| **Early Chunking** (current) | Split first, embed separately | Simple, fast, but loses context |
| **Late Chunking** | Embed full doc first, then apply chunk boundaries | Preserves full context, 10-12% accuracy improvement |
| **Contextual Chunking** | Add LLM-generated context prefix to each chunk | Works with any embedding API, 2-18% improvement |
| **Adaptive Chunking** | Respects semantic boundaries (sentences, paragraphs) | Better coherence for prose |
| **Topic-Based Chunking** | Groups content by semantic topic | Multi-topic documents |
| **Entity-Based Chunking** | Groups by entity mentions | Entity-focused queries |
| **Code-Specific Chunking** | Respects function/class boundaries | Source code documents |

### Strategy Selection by Document Type

| Document Type | Recommended Strategy | Rationale |
|---------------|---------------------|-----------|
| Technical books | Contextual + Adaptive | Cross-references, technical terms |
| Academic papers | Contextual Chunking | Citations, anaphoric references |
| Source code | Code-Specific Chunking | Function/class boundaries |
| Mixed content | Hybrid Chunking | Combines multiple approaches |

## Success Metrics

| Metric | Current | Target |
|--------|---------|--------|
| Retrieval accuracy (top-5) | Baseline | +10-15% |
| Anaphoric reference resolution | Poor | Good |
| Cross-section queries | Often fails | Improved |
| Processing speed | Fast | ≤20% slower acceptable |

## References

- [Choosing the Right Chunking Strategy (Dev.to)](https://dev.to/vishalmysore/choosing-the-right-chunking-strategy-a-comprehensive-guide-to-rag-optimization-4nan)
- [Late Chunking Research Paper](https://arxiv.org/abs/2409.04701)
- [Jina AI Late Chunking](https://arxiv.org/pdf/2504.19754)
- [LangChain Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- Current implementation: `hybrid_fast_seed.ts` lines 2289-2296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Chunking Strategy #48

Overview

Current Implementation

Research Foundation

Chunking Strategy Taxonomy

Strategy Selection by Document Type

Success Metrics

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Strategy	Description	Use Case
Early Chunking (current)	Split first, embed separately	Simple, fast, but loses context
Late Chunking	Embed full doc first, then apply chunk boundaries	Preserves full context, 10-12% accuracy improvement
Contextual Chunking	Add LLM-generated context prefix to each chunk	Works with any embedding API, 2-18% improvement
Adaptive Chunking	Respects semantic boundaries (sentences, paragraphs)	Better coherence for prose
Topic-Based Chunking	Groups content by semantic topic	Multi-topic documents
Entity-Based Chunking	Groups by entity mentions	Entity-focused queries
Code-Specific Chunking	Respects function/class boundaries	Source code documents

Document Type	Recommended Strategy	Rationale
Technical books	Contextual + Adaptive	Cross-references, technical terms
Academic papers	Contextual Chunking	Citations, anaphoric references
Source code	Code-Specific Chunking	Function/class boundaries
Mixed content	Hybrid Chunking	Combines multiple approaches

Metric	Current	Target
Retrieval accuracy (top-5)	Baseline	+10-15%
Anaphoric reference resolution	Poor	Good
Cross-section queries	Often fails	Improved
Processing speed	Fast	≤20% slower acceptable

Improve Chunking Strategy #48

Description

Overview

Current Implementation

Research Foundation

Chunking Strategy Taxonomy

Strategy Selection by Document Type

Success Metrics

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions