A semantic Markdown chunker that preserves document structure for RAG and LLM pipelines. Never breaks code blocks, tables, or headers—every chunk stays semantically complete.
pip install chunkanafrom chunkana import chunk_markdown
text = """
# My Document
## Section One
Some content here.
## Section Two
More content with code:
```python
def hello():
print("Hello!")
```
"""
chunks = chunk_markdown(text)
for chunk in chunks:
print(f"Lines {chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}")
print(f"Content: {chunk.content[:100]}...")- Structure-safe chunks: never split code blocks, tables, lists, or LaTeX blocks
- Useful metadata:
header_path,content_type, line ranges, and strategy used - Multiple strategies: automatic selection or manual override
- Hierarchy support: navigate a chunk tree or flatten it for indexing
- Streaming options: process large files without loading them all into memory
Problem: Traditional splitters break Markdown structure, fragmenting code blocks, tables, and lists.
Solution: Chunkana preserves semantic boundaries while providing rich metadata for retrieval:
- ✅ Never breaks code blocks, tables, or LaTeX formulas
- ✅ Preserves hierarchy with header paths like
/Introduction/Overview - ✅ Rich metadata for filtering, ranking, and context
- ✅ Streaming support for large documents
- ✅ Multiple output formats (JSON, Dify-compatible, etc.)
- Semantic preservation: Headers, lists, tables, code blocks, and LaTeX stay intact
- Smart strategies: Auto-selects optimal chunking approach per document
- Hierarchical navigation: Build chunk trees for section-aware retrieval
- Overlap metadata: Context continuity without content duplication
- Memory efficient: Stream large files without loading everything into RAM
- Code-context binding: Keep code with the explanation around it
- Adaptive sizing: Optional size tuning based on document complexity
- Table grouping: Keep related tables together for better retrieval
- Obsidian cleanup: Strip
^block-idreferences when desired
from chunkana import chunk_markdown, ChunkConfig
config = ChunkConfig(
max_chunk_size=2048,
min_chunk_size=256,
overlap_size=100,
)
chunks = chunk_markdown(text, config)from chunkana import analyze_markdown, chunk_with_metrics
analysis = analyze_markdown(text)
print(f"Code ratio: {analysis.code_ratio}")
chunks, metrics = chunk_with_metrics(text)
print(f"Average chunk size: {metrics.avg_chunk_size}")from chunkana import MarkdownChunker, ChunkConfig
chunker = MarkdownChunker(ChunkConfig(validate_invariants=True))
result = chunker.chunk_hierarchical(text)
# Get leaf chunks for indexing
flat_chunks = result.get_flat_chunks()
# Navigate the hierarchy
root = result.get_chunk(result.root_id)
children = result.get_children(result.root_id)from chunkana import MarkdownChunker
chunker = MarkdownChunker()
for chunk in chunker.chunk_file_streaming("large_document.md"):
print(f"Chunk {chunk.metadata['chunk_index']}: {chunk.size} chars")from chunkana import ChunkConfig
from chunkana.adaptive_sizing import AdaptiveSizeConfig
from chunkana.table_grouping import TableGroupingConfig
config = ChunkConfig(
max_chunk_size=4096,
overlap_size=200,
enable_code_context_binding=True,
preserve_latex_blocks=True,
strip_obsidian_block_ids=True,
use_adaptive_sizing=True,
adaptive_config=AdaptiveSizeConfig(base_size=1500, code_weight=0.4),
group_related_tables=True,
table_grouping_config=TableGroupingConfig(max_distance_lines=10),
)from chunkana.renderers import render_json, render_dify_style
chunks = chunk_markdown(text)
# JSON format
json_output = render_json(chunks)
# Dify-compatible format
dify_output = render_dify_style(chunks)Primary convenience functions:
chunk_markdown(text, config=None)→List[Chunk]chunk_hierarchical(text, config=None)→HierarchicalChunkingResultchunk_file(path, config=None)/chunk_file_streaming(path, config=None)analyze_markdown(text, config=None)→ContentAnalysischunk_with_metrics(text, config=None)→(List[Chunk], ChunkingMetrics)iter_chunks(text, config=None)→Iterator[Chunk]
Each chunk includes rich metadata for retrieval:
{
"content": "# Section\nContent here...",
"start_line": 1,
"end_line": 10,
"size": 156,
"metadata": {
"chunk_index": 0,
"content_type": "section",
"header_path": "/Introduction/Overview",
"header_level": 2,
"strategy": "structural",
"has_code": false,
"overlap_size": 100
}
}- Python 3.12+
- No external dependencies for core functionality
- Optional:
pip install "chunkana[docs]"for documentation tools
- Dify: Direct compatibility with Dify workflows
- n8n: Automation pipeline integration
- Windmill: Batch processing workflows
- Quick Start Guide - Get started in minutes
- Configuration - All configuration options
- Strategies - How chunking strategies work
- Renderers - Output format options
- Metadata Reference - Chunk metadata definitions
- Performance Guide - Tuning for speed and memory
- API Reference - Complete API documentation
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Code style guidelines
- Testing procedures
- Pull request process
MIT License - see LICENSE for details.
Need help? Check the documentation or open an issue.