Learn how to split documents into smart chunks for Retrieval-Augmented Generation (RAG) systems
RAG (Retrieval-Augmented Generation) is a technique that helps AI models answer questions by first finding relevant information from a database. To make this work, we need to break large documents into smaller, manageable pieces called chunks.
This repository teaches you how to:
- ๐ Extract text from PDFs with multiple articles
- ๐ Identify individual articles using AI
- โ๏ธ Split text into optimal chunks for searching
- ๐ข Convert text to vectors (numbers) for semantic search
- ๐พ Store everything in a vector database with metadata
When you have a large PDF with multiple articles:
- Without chunking: AI tries to process everything at once (slow and expensive)
- With smart chunking: AI only processes relevant pieces (fast and accurate)
PDF โ Text Extraction โ Article Identification โ Chunking โ Vectorization โ Storage
- Chunk Size: 2048 characters (optimal for most LLMs)
- Overlap: 300 characters (maintains context between chunks)
- Small Chunk Handling: Merge chunks under 400 characters
- Large Chunk Splitting: Force new chunks after 1600 characters
Each chunk gets tagged with:
- Source article information
- Publication date
- Themes and topics
- Original position in document
from pdf_processor import PDFArticleProcessor
# Process a PDF with multiple news articles
processor = PDFArticleProcessor("news_compilation.pdf")
chunks_table, metadata_table = processor.process()
# Now you can search semantically!
query = "articles about inflation"
results = chunks_table.search(query).limit(5).to_list()
for result in results:
print(f"Found in: {result['article_title']}")
print(f"Relevance: {result['_distance']}")# Opens PDF and extracts all text
text = extract_text_from_pdf("document.pdf")# AI identifies where each article starts and ends
articles = extract_articles_metadata(text)
# Returns: title, source, date, summary, themes for each article# Splits text into optimal chunks with overlap
chunks = create_chunks_with_metadata(text, articles)
# Each chunk knows which article it belongs to# Convert text to vectors (numbers) for similarity search
embeddings = generate_embeddings(chunk_texts)
# Each chunk gets 1536 numbers representing its meaning# Save to LanceDB for fast retrieval
upload_to_lancedb(chunks, embeddings, metadata)Find content by meaning, not exact keywords:
# Finds related content even without exact word matches
search("inflation")
# Also finds: "rising prices", "cost increases", "economic pressure"Combine semantic search with metadata filters:
# Only search in Bloomberg articles from March
search("inflation", filters={"source": "Bloomberg", "month": "March"})- Text โ Numbers that represent meaning
- Similar meanings = Similar numbers
- Computers can compare numbers faster than text
- Extra information about each chunk
- Like labels on a file folder
- Helps filter and organize search results
- Chunks share 300 characters with neighbors
- Prevents losing context at boundaries
- Like overlapping photos in a panorama
- Clone this repository:
git clone https://github.com/yourusername/rag-chunking-for-beginners.git
cd rag-chunking-for-beginners- Install dependencies:
pip install -r requirements.txt- Set OpenAI API key:
export OPENAI_API_KEY="your-api-key"- Run example:
python examples/basic_usage.pyFor a typical PDF with news articles:
- Input: 1 PDF with 5-10 articles
- Output: 50-100 searchable chunks
- Processing Time: 30-60 seconds
- Storage: ~5MB in vector database
- Search Speed: <100ms per query
rag-chunking-for-beginners/
โโโ README.md # This file
โโโ README_PT.md # Portuguese explanation
โโโ requirements.txt # Python dependencies
โโโ src/ # Source code
โ โโโ pdf_processor.py # Main processing pipeline
โ โโโ chunker.py # Chunking logic
โ โโโ vectorizer.py # Embedding generation
โโโ examples/ # Example code
โ โโโ basic_usage.py # Simple example
โ โโโ semantic_search.py # Search examples
โ โโโ sample_pdf/ # Sample PDFs to test
โโโ docs/ # Additional documentation
โโโ chunking_strategies.md
โโโ metadata_best_practices.md
- What is RAG? - Introduction to Retrieval-Augmented Generation
- Chunking Strategies - Different ways to split documents
- Vector Databases 101 - Understanding vector storage
- Metadata Best Practices - How to enrich your chunks
- News Aggregation: Process multiple news sources
- Research Papers: Extract and search academic content
- Legal Documents: Find specific clauses or terms
- Documentation: Create searchable knowledge bases
- Customer Support: Build Q&A systems from FAQs
This is a learning repository! Contributions that help beginners are especially welcome:
- Simplify explanations
- Add more examples
- Translate to other languages
- Fix bugs or improve code clarity
MIT License - Use freely for learning and projects!
Q: What's the optimal chunk size? A: 2048 characters works well for most cases, balancing context and processing cost.
Q: Why not just use the whole document? A: Large documents exceed LLM token limits and are expensive to process.
Q: How is this different from simple text search? A: Semantic search understands meaning, not just keywords.
Q: Can I use this for languages other than English? A: Yes! Just adjust the prompts and language settings.
โญ Star this repo if it helped you understand RAG chunking!
๐ Part of the "AI Concepts for Beginners" series