Chunkana

A semantic Markdown chunker that preserves document structure for RAG and LLM pipelines. Never breaks code blocks, tables, or headers—every chunk stays semantically complete.

Quick Start

pip install chunkana

from chunkana import chunk_markdown

text = """
# My Document

## Section One
Some content here.

## Section Two
More content with code:

```python
def hello():
    print("Hello!")
```
"""

chunks = chunk_markdown(text)
for chunk in chunks:
    print(f"Lines {chunk.start_line}-{chunk.end_line}: {chunk.metadata['header_path']}")
    print(f"Content: {chunk.content[:100]}...")

What You Get

Structure-safe chunks: never split code blocks, tables, lists, or LaTeX blocks
Useful metadata: header_path, content_type, line ranges, and strategy used
Multiple strategies: automatic selection or manual override
Hierarchy support: navigate a chunk tree or flatten it for indexing
Streaming options: process large files without loading them all into memory

Why Chunkana?

Problem: Traditional splitters break Markdown structure, fragmenting code blocks, tables, and lists.

Solution: Chunkana preserves semantic boundaries while providing rich metadata for retrieval:

✅ Never breaks code blocks, tables, or LaTeX formulas
✅ Preserves hierarchy with header paths like /Introduction/Overview
✅ Rich metadata for filtering, ranking, and context
✅ Streaming support for large documents
✅ Multiple output formats (JSON, Dify-compatible, etc.)

Key Features

Semantic preservation: Headers, lists, tables, code blocks, and LaTeX stay intact
Smart strategies: Auto-selects optimal chunking approach per document
Hierarchical navigation: Build chunk trees for section-aware retrieval
Overlap metadata: Context continuity without content duplication
Memory efficient: Stream large files without loading everything into RAM
Code-context binding: Keep code with the explanation around it
Adaptive sizing: Optional size tuning based on document complexity
Table grouping: Keep related tables together for better retrieval
Obsidian cleanup: Strip ^block-id references when desired

Usage Examples

Basic Configuration

from chunkana import chunk_markdown, ChunkConfig

config = ChunkConfig(
    max_chunk_size=2048,
    min_chunk_size=256,
    overlap_size=100,
)

chunks = chunk_markdown(text, config)

Content Analysis and Metrics

from chunkana import analyze_markdown, chunk_with_metrics

analysis = analyze_markdown(text)
print(f"Code ratio: {analysis.code_ratio}")

chunks, metrics = chunk_with_metrics(text)
print(f"Average chunk size: {metrics.avg_chunk_size}")

Hierarchical Chunking

from chunkana import MarkdownChunker, ChunkConfig

chunker = MarkdownChunker(ChunkConfig(validate_invariants=True))
result = chunker.chunk_hierarchical(text)

# Get leaf chunks for indexing
flat_chunks = result.get_flat_chunks()

# Navigate the hierarchy
root = result.get_chunk(result.root_id)
children = result.get_children(result.root_id)

Streaming Large Documents

from chunkana import MarkdownChunker

chunker = MarkdownChunker()
for chunk in chunker.chunk_file_streaming("large_document.md"):
    print(f"Chunk {chunk.metadata['chunk_index']}: {chunk.size} chars")

Advanced Configuration Highlights

from chunkana import ChunkConfig
from chunkana.adaptive_sizing import AdaptiveSizeConfig
from chunkana.table_grouping import TableGroupingConfig

config = ChunkConfig(
    max_chunk_size=4096,
    overlap_size=200,
    enable_code_context_binding=True,
    preserve_latex_blocks=True,
    strip_obsidian_block_ids=True,
    use_adaptive_sizing=True,
    adaptive_config=AdaptiveSizeConfig(base_size=1500, code_weight=0.4),
    group_related_tables=True,
    table_grouping_config=TableGroupingConfig(max_distance_lines=10),
)

Output Formats

from chunkana.renderers import render_json, render_dify_style

chunks = chunk_markdown(text)

# JSON format
json_output = render_json(chunks)

# Dify-compatible format
dify_output = render_dify_style(chunks)

Core API Surface

Primary convenience functions:

chunk_markdown(text, config=None) → List[Chunk]
chunk_hierarchical(text, config=None) → HierarchicalChunkingResult
chunk_file(path, config=None) / chunk_file_streaming(path, config=None)
analyze_markdown(text, config=None) → ContentAnalysis
chunk_with_metrics(text, config=None) → (List[Chunk], ChunkingMetrics)
iter_chunks(text, config=None) → Iterator[Chunk]

Metadata Schema

Each chunk includes rich metadata for retrieval:

{
    "content": "# Section\nContent here...",
    "start_line": 1,
    "end_line": 10,
    "size": 156,
    "metadata": {
        "chunk_index": 0,
        "content_type": "section",
        "header_path": "/Introduction/Overview",
        "header_level": 2,
        "strategy": "structural",
        "has_code": false,
        "overlap_size": 100
    }
}

Requirements

Python 3.12+
No external dependencies for core functionality
Optional: pip install "chunkana[docs]" for documentation tools

Integrations

Dify: Direct compatibility with Dify workflows
n8n: Automation pipeline integration
Windmill: Batch processing workflows

Documentation

Quick Start Guide - Get started in minutes
Configuration - All configuration options
Strategies - How chunking strategies work
Renderers - Output format options
Metadata Reference - Chunk metadata definitions
Performance Guide - Tuning for speed and memory
API Reference - Complete API documentation

Contributing

We welcome contributions! See CONTRIBUTING.md for:

Development setup
Code style guidelines
Testing procedures
Pull request process

License

MIT License - see LICENSE for details.

Need help? Check the documentation or open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
.kiro/steering		.kiro/steering
.qoder/repowiki/en		.qoder/repowiki/en
docs		docs
scripts		scripts
src/chunkana		src/chunkana
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
debug_test.py		debug_test.py
document_analysis_checkpoint.md		document_analysis_checkpoint.md
migration_artifacts_catalog.md		migration_artifacts_catalog.md
migration_cleanup_report.md		migration_cleanup_report.md
migration_content_analysis.md		migration_content_analysis.md
plugin_specific_code_report.md		plugin_specific_code_report.md
pyproject.toml		pyproject.toml
refactoring_roadmap.md		refactoring_roadmap.md
rollback_procedures.md		rollback_procedures.md
test_comprehensive_invariants.py		test_comprehensive_invariants.py
test_dangling_headers.py		test_dangling_headers.py
test_debug_mode.py		test_debug_mode.py
test_invariants_simple.py		test_invariants_simple.py
test_micro_chunks.py		test_micro_chunks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunkana

Quick Start

What You Get

Why Chunkana?

Key Features

Usage Examples

Basic Configuration

Content Analysis and Metrics

Hierarchical Chunking

Streaming Large Documents

Advanced Configuration Highlights

Output Formats

Core API Surface

Metadata Schema

Requirements

Integrations

Documentation

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chunkana

Quick Start

What You Get

Why Chunkana?

Key Features

Usage Examples

Basic Configuration

Content Analysis and Metrics

Hierarchical Chunking

Streaming Large Documents

Advanced Configuration Highlights

Output Formats

Core API Surface

Metadata Schema

Requirements

Integrations

Documentation

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages