Skip to content

prcodex/rag-beginners

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

RAG Chunking with Metadata for Beginners ๐Ÿ“š

Learn how to split documents into smart chunks for Retrieval-Augmented Generation (RAG) systems

Python OpenAI LanceDB

๐ŸŽฏ What is RAG Chunking?

RAG (Retrieval-Augmented Generation) is a technique that helps AI models answer questions by first finding relevant information from a database. To make this work, we need to break large documents into smaller, manageable pieces called chunks.

This repository teaches you how to:

  • ๐Ÿ“„ Extract text from PDFs with multiple articles
  • ๐Ÿ” Identify individual articles using AI
  • โœ‚๏ธ Split text into optimal chunks for searching
  • ๐Ÿ”ข Convert text to vectors (numbers) for semantic search
  • ๐Ÿ’พ Store everything in a vector database with metadata

๐Ÿš€ Why Chunking Matters in RAG

When you have a large PDF with multiple articles:

  • Without chunking: AI tries to process everything at once (slow and expensive)
  • With smart chunking: AI only processes relevant pieces (fast and accurate)

๐Ÿ“š What You'll Learn

1. Document Processing Pipeline

PDF โ†’ Text Extraction โ†’ Article Identification โ†’ Chunking โ†’ Vectorization โ†’ Storage

2. Chunking Strategy

  • Chunk Size: 2048 characters (optimal for most LLMs)
  • Overlap: 300 characters (maintains context between chunks)
  • Small Chunk Handling: Merge chunks under 400 characters
  • Large Chunk Splitting: Force new chunks after 1600 characters

3. Metadata Enrichment

Each chunk gets tagged with:

  • Source article information
  • Publication date
  • Themes and topics
  • Original position in document

๐ŸŽฎ Simple Example

from pdf_processor import PDFArticleProcessor

# Process a PDF with multiple news articles
processor = PDFArticleProcessor("news_compilation.pdf")
chunks_table, metadata_table = processor.process()

# Now you can search semantically!
query = "articles about inflation"
results = chunks_table.search(query).limit(5).to_list()

for result in results:
    print(f"Found in: {result['article_title']}")
    print(f"Relevance: {result['_distance']}")

๐Ÿ“ How It Works (Step by Step)

Step 1: Extract Text from PDF

# Opens PDF and extracts all text
text = extract_text_from_pdf("document.pdf")

Step 2: Identify Individual Articles

# AI identifies where each article starts and ends
articles = extract_articles_metadata(text)
# Returns: title, source, date, summary, themes for each article

Step 3: Create Smart Chunks

# Splits text into optimal chunks with overlap
chunks = create_chunks_with_metadata(text, articles)
# Each chunk knows which article it belongs to

Step 4: Generate Embeddings

# Convert text to vectors (numbers) for similarity search
embeddings = generate_embeddings(chunk_texts)
# Each chunk gets 1536 numbers representing its meaning

Step 5: Store in Vector Database

# Save to LanceDB for fast retrieval
upload_to_lancedb(chunks, embeddings, metadata)

๐Ÿ” Types of Search You Can Do

Semantic Search

Find content by meaning, not exact keywords:

# Finds related content even without exact word matches
search("inflation") 
# Also finds: "rising prices", "cost increases", "economic pressure"

Filtered Search

Combine semantic search with metadata filters:

# Only search in Bloomberg articles from March
search("inflation", filters={"source": "Bloomberg", "month": "March"})

๐Ÿ’ก Key Concepts for Beginners

What are Vectors?

  • Text โ†’ Numbers that represent meaning
  • Similar meanings = Similar numbers
  • Computers can compare numbers faster than text

What is Metadata?

  • Extra information about each chunk
  • Like labels on a file folder
  • Helps filter and organize search results

Why Use Overlap?

  • Chunks share 300 characters with neighbors
  • Prevents losing context at boundaries
  • Like overlapping photos in a panorama

๐Ÿ› ๏ธ Installation

  1. Clone this repository:
git clone https://github.com/yourusername/rag-chunking-for-beginners.git
cd rag-chunking-for-beginners
  1. Install dependencies:
pip install -r requirements.txt
  1. Set OpenAI API key:
export OPENAI_API_KEY="your-api-key"
  1. Run example:
python examples/basic_usage.py

๐Ÿ“Š What to Expect

For a typical PDF with news articles:

  • Input: 1 PDF with 5-10 articles
  • Output: 50-100 searchable chunks
  • Processing Time: 30-60 seconds
  • Storage: ~5MB in vector database
  • Search Speed: <100ms per query

๐Ÿ“ Repository Structure

rag-chunking-for-beginners/
โ”œโ”€โ”€ README.md              # This file
โ”œโ”€โ”€ README_PT.md           # Portuguese explanation
โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ”œโ”€โ”€ src/                   # Source code
โ”‚   โ”œโ”€โ”€ pdf_processor.py  # Main processing pipeline
โ”‚   โ”œโ”€โ”€ chunker.py        # Chunking logic
โ”‚   โ””โ”€โ”€ vectorizer.py     # Embedding generation
โ”œโ”€โ”€ examples/              # Example code
โ”‚   โ”œโ”€โ”€ basic_usage.py    # Simple example
โ”‚   โ”œโ”€โ”€ semantic_search.py # Search examples
โ”‚   โ””โ”€โ”€ sample_pdf/       # Sample PDFs to test
โ””โ”€โ”€ docs/                  # Additional documentation
    โ”œโ”€โ”€ chunking_strategies.md
    โ””โ”€โ”€ metadata_best_practices.md

๐ŸŽ“ Learning Resources

๐Ÿšฆ Common Use Cases

  • News Aggregation: Process multiple news sources
  • Research Papers: Extract and search academic content
  • Legal Documents: Find specific clauses or terms
  • Documentation: Create searchable knowledge bases
  • Customer Support: Build Q&A systems from FAQs

๐Ÿค Contributing

This is a learning repository! Contributions that help beginners are especially welcome:

  • Simplify explanations
  • Add more examples
  • Translate to other languages
  • Fix bugs or improve code clarity

๐Ÿ“„ License

MIT License - Use freely for learning and projects!

โ“ FAQ

Q: What's the optimal chunk size? A: 2048 characters works well for most cases, balancing context and processing cost.

Q: Why not just use the whole document? A: Large documents exceed LLM token limits and are expensive to process.

Q: How is this different from simple text search? A: Semantic search understands meaning, not just keywords.

Q: Can I use this for languages other than English? A: Yes! Just adjust the prompts and language settings.


โญ Star this repo if it helped you understand RAG chunking!

๐Ÿ“š Part of the "AI Concepts for Beginners" series

About

rag-beginners

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published