Advanced RAG Strategies - Complete Guide

A comprehensive resource for understanding and implementing advanced Retrieval-Augmented Generation strategies.

This repository demonstrates 16 RAG strategies with:

📖 Detailed theory and research (docs/)
💻 Simple pseudocode examples (examples/)
🔧 Full code examples (implementation/)
🎓 Student learning guide (docs/guides/STUDENT_GUIDE.md)
🔧 Troubleshooting guide (docs/guides/TROUBLESHOOTING.md)

Perfect for: AI engineers, ML practitioners, data science students, and anyone building RAG systems.

🎯 Strategy Overview

#	Strategy	Status	Use Case	Pros	Cons
1	Re-ranking	✅ Code Example	Precision-critical	Highly accurate results	Slower, more compute
2	Agentic RAG	✅ Code Example	Flexible retrieval needs	Autonomous tool selection	More complex logic
3	Knowledge Graphs	📝 Pseudocode Only	Relationship-heavy	Captures connections	Infrastructure overhead
4	Contextual Retrieval	✅ Code Example	Critical documents	35-49% better accuracy	High ingestion cost
5	Query Expansion	✅ Code Example	Ambiguous queries	Better recall, multiple perspectives	Extra LLM call, higher cost
6	Multi-Query RAG	✅ Code Example	Broad searches	Comprehensive coverage	Multiple API calls
7	Context-Aware Chunking	✅ Code Example	All documents	Semantic coherence	Slightly slower ingestion
8	Late Chunking	📝 Pseudocode Only	Context preservation	Full document context	Requires long-context models
9	Hierarchical RAG	📝 Pseudocode Only	Complex documents	Precision + context	Complex setup
10	Self-Reflective RAG	✅ Code Example	Research queries	Self-correcting	Highest latency
11	Fine-tuned Embeddings	📝 Pseudocode Only	Domain-specific	Best accuracy	Training required
12	Hybrid Retrieval	✅ Code Example	Keyword-sensitive	Balanced recall	More complex infra
13	Fact Verification	✅ Code Example	High-stakes domains	Traceability	Higher latency
14	Multi-hop Reasoning	✅ Code Example	Complex questions	Solves compositional queries	Expensive
15	Uncertainty Estimation	✅ Code Example	Risk-sensitive apps	Trustworthy outputs	More compute
16	Adaptive Chunking	✅ Code Example	Heterogeneous docs	Better precision	Complex ingestion

Legend

✅ Code Example: Full code in implementation/ (educational, not production-ready)

📋 Prerequisites

System Requirements

Python 3.9+
PostgreSQL with pgvector extension
- Cloud options: Neon, Supabase
- Self-hosted: PostgreSQL 12+ with pgvector

System Dependencies

Ubuntu/Debian:

sudo apt-get update && sudo apt-get install -y \
    ffmpeg \
    build-essential \
    gcc \
    postgresql-client \
    libpq-dev

macOS:

brew install ffmpeg postgresql
# Xcode Command Line Tools (includes gcc, build tools)
xcode-select --install

Windows:

ffmpeg: Download from ffmpeg.org
PostgreSQL: Download from postgresql.org
Build tools: Install Visual Studio Build Tools

What each package does:

ffmpeg - Audio/video processing for Whisper transcription
build-essential & gcc - Compilers for building Python packages (psycopg2, etc.)
postgresql-client - PostgreSQL command-line tools (psql)
libpq-dev - PostgreSQL development headers for psycopg2

API Keys

OpenAI API Key for embeddings and LLM
- Get from: platform.openai.com/api-keys
📝 Pseudocode Only: Conceptual examples in examples/

🚀 Quick Start

1. View Pseudocode Examples (No Setup Required)

cd examples
# Browse simple, < 50 line examples for each strategy
cat 01_reranking.py

2. Interactive Strategy Lab (Recommended)

Try strategies side-by-side with a web UI!

cd implementation

# Install dependencies
pip install -r requirements-advanced.txt

# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY

# Initialize database
psql $DATABASE_URL < sql/schema.sql

# Ingest sample documents
python -m ingestion.ingest --documents ./documents

# Launch Streamlit app
streamlit run app.py

The app opens at http://localhost:8501

Features:

🧪 Strategy Lab: Compare up to 3 strategies side-by-side
📊 Visual Metrics: See latency, tokens, and costs for each approach
📁 File Upload: Test with your own documents
💡 Educational: Tooltips explain each strategy's trade-offs

Quick guide: docs/implementation/QUICK_START.md

3. CLI Agent (Command Line)

Note: These are educational examples to show how strategies work in real code. Not guaranteed to be fully functional or production-ready.

cd implementation

# Install dependencies (if not done above)
pip install -r requirements-advanced.txt

# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY

# Ingest documents
python -m ingestion.ingest --documents ./documents --chunker adaptive

# Run the advanced agent
python rag_agent_advanced.py

🎓 For Students & Learners

New to RAG? Start here!

This repository includes comprehensive learning resources:

📖 Student Guide

docs/guides/STUDENT_GUIDE.md - Your complete learning path:

Structured 9-week curriculum from beginner to advanced
Core concepts explained with examples
Practical exercises and project ideas
Common pitfalls and how to avoid them
Production deployment considerations

Quick learning path:

Week 1-2:  Basics (chunking, embeddings, vector search)
Week 3-4:  Query enhancement (expansion, multi-query)
Week 5-6:  Advanced retrieval (hybrid, reranking, self-reflective)
Week 7-8:  Generation enhancement (fact verification, multi-hop)
Week 9+:   Specialized topics (knowledge graphs, fine-tuning)

🔧 Troubleshooting Guide

docs/guides/TROUBLESHOOTING.md - Solutions to common issues:

Setup problems (dependencies, database, API keys)
Ingestion errors (file processing, embeddings, memory)
Retrieval issues (no results, low relevance, slow queries)
Agent problems (hallucinations, tool calling)
Testing and debugging

📝 Test Paper

LoRA-SHIFT Research Paper - A comprehensive test document:

19,000+ characters of technical content
Structured research paper format
Perfect for testing RAG strategies
Includes abstract, methodology, results, and appendices

💻 Pseudocode Examples

All strategies have simple, working pseudocode examples in examples/.

Each file is < 50 lines and demonstrates:

Core concept
How to implement with Pydantic AI
Integration with PG Vector

Example (05_query_expansion.py):

from pydantic_ai import Agent
import psycopg2
from pgvector.psycopg2 import register_vector

agent = Agent('openai:gpt-4o', system_prompt='RAG assistant with query expansion')

@agent.tool
def expand_query(query: str) -> list[str]:
    """Expand single query into multiple variations"""
    expansion_prompt = f"Generate 3 variations of: '{query}'"
    variations = llm_generate(expansion_prompt)
    return [query] + variations

@agent.tool
def search_knowledge_base(queries: list[str]) -> str:
    """Search vector DB with multiple queries"""
    all_results = []
    for query in queries:
        query_embedding = get_embedding(query)
        results = db.query('SELECT * FROM chunks ORDER BY embedding <=> %s', query_embedding)
        all_results.extend(results)
    return deduplicate(all_results)

Browse all pseudocode: examples/README.md

🏗️ Code Examples

⚠️ Important Note: The implementation/ folder contains educational code examples based on a real implementation, not production-ready. These strategies are added to demonstrate concepts and show how they work in real code. They are not guaranteed to be fully working and it's not ideal to have all strategies in one codebase (which is why I haven't refined this specifically for production use). Use these as learning references and starting points for your own implementations. Think of this as an "off-the-shelf RAG implementation" with strategies added for demonstration purposes. Use as inspiration for your own production systems.

Architecture

implementation/
├── rag_agent_advanced.py          # Agent with all strategy examples
├── ingestion/
│   ├── ingest.py                  # Document ingestion pipeline
│   ├── chunker.py                 # Context-aware chunking (Docling)
│   ├── embedder.py                # OpenAI embeddings
│   └── contextual_enrichment.py   # Anthropic's contextual retrieval
├── utils/
│   ├── db_utils.py                # Database utilities
│   └── models.py                  # Pydantic models
└── IMPLEMENTATION_GUIDE.md        # Detailed implementation reference

Tech Stack:

Pydantic AI - Agent framework
PostgreSQL + pgvector - Vector search
Docling - Hybrid chunking
OpenAI - Embeddings and LLM

📖 Detailed Strategy Guide

✅ Code Examples (Educational)

1. Re-ranking

Status: ✅ Code Example

File: rag_agent_advanced.py (Lines 194-256)

What It Is

Two-stage retrieval: Vector search (20-50+ candidates) → Reranking model to filter (top 5).

Pros & Cons

✅ Significantly better precision, more knowledge considered without overwhelming LLM

❌ Slightly slower than pure vector search, uses more compute

Code Example

# Lines 194-256 in rag_agent_advanced.py
async def search_with_reranking(ctx: RunContext[None], query: str, limit: int = 5) -> str:
    """Two-stage retrieval with cross-encoder re-ranking."""
    initialize_reranker()  # Loads cross-encoder/ms-marco-MiniLM-L-6-v2

    # Stage 1: Fast vector retrieval (retrieve 20 candidates)
    candidate_limit = min(limit * 4, 20)
    results = await vector_search(query, candidate_limit)

    # Stage 2: Re-rank with cross-encoder
    pairs = [[query, row['content']] for row in results]
    scores = reranker.predict(pairs)

    # Sort by new scores and return top N
    reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)[:limit]
    return format_results(reranked)

Model: cross-encoder/ms-marco-MiniLM-L-6-v2

See:

2. Agentic RAG

Status: ✅ Code Example

Files: rag_agent_advanced.py (Lines 263-354)

What It Is

Agent autonomously chooses between multiple retrieval tools, example:

search_knowledge_base() - Semantic search over chunks (can include hybrid search: dense vector + sparse keyword/BM25)
retrieve_full_document() - Pull entire documents when chunks aren't enough

Note: Hybrid search (combining dense vector embeddings with sparse keyword search like BM25) is typically implemented as part of the agentic retrieval strategy, giving the agent access to both semantic similarity and keyword matching.

Pros & Cons

✅ Flexible, adapts to query needs automatically

❌ More complex, less predictable behavior

Code Example

# Tool 1: Semantic search (Lines 263-305)
@agent.tool
async def search_knowledge_base(query: str, limit: int = 5) -> str:
    """Standard semantic search over document chunks."""
    query_embedding = await embedder.embed_query(query)
    results = await db.match_chunks(query_embedding, limit)
    return format_results(results)

# Tool 2: Full document retrieval (Lines 308-354)
@agent.tool
async def retrieve_full_document(document_title: str) -> str:
    """Retrieve complete document when chunks lack context."""
    result = await db.query(
        "SELECT title, content FROM documents WHERE title ILIKE %s",
        f"%{document_title}%"
    )
    return f"**{result['title']}**\n\n{result['content']}"

Example Flow:

User: "What's the full refund policy?"
Agent:
  1. Calls search_knowledge_base("refund policy")
  2. Finds chunks mentioning "refund_policy.pdf"
  3. Calls retrieve_full_document("refund policy")
  4. Returns complete document

See:

3. Knowledge Graphs

Status: 📝 Pseudocode Only (Graphiti)

Why not in code examples: Requires Neo4j infrastructure, entity extraction

What It Is

Combines vector search with graph databases (such as Neo4j/FalkorDB) to capture entity relationships.

Pros & Cons

✅ Captures relationships vectors miss, great for interconnected data

❌ Requires Neo4j setup, entity extraction, graph maintenance, slower and more expensive

Pseudocode Concept (Graphiti)

# From 03_knowledge_graphs.py (with Graphiti)
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType

# Initialize Graphiti (connects to Neo4j)
graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):
    """Ingest document into Graphiti knowledge graph."""
    # Graphiti automatically extracts entities and relationships
    await graphiti.add_episode(
        name=source,
        episode_body=text,
        source=EpisodeType.text,
        source_description=f"Document: {source}"
    )

@agent.tool
async def search_knowledge_graph(query: str) -> str:
    """Hybrid search: semantic + keyword + graph traversal."""
    # Graphiti combines:
    # - Semantic similarity (embeddings)
    # - BM25 keyword search
    # - Graph structure traversal
    # - Temporal context (when was this true?)

    results = await graphiti.search(query=query, num_results=5)

    return format_graph_results(results)

Framework: Graphiti from Zep - Temporal knowledge graphs for agents

See:

Pseudocode: 03_knowledge_graphs.py
Research: docs/03-knowledge-graphs.md

4. Contextual Retrieval

Status: ✅ Code Example (Optional)

File: ingestion/contextual_enrichment.py (Lines 41-89)

What It Is

Anthropic's method: Adds document-level context to each chunk before embedding. LLM generates 1-2 sentences explaining what the chunk discusses in relation to the whole document.

Pros & Cons

✅ 35-49% reduction in retrieval failures, chunks are self-contained

❌ Expensive (1 LLM call per chunk), slower ingestion

Before/After Example

BEFORE:
"Clean data is essential. Remove duplicates, handle missing values..."

AFTER:
"This chunk from 'ML Best Practices' discusses data preparation techniques
for machine learning workflows.

Clean data is essential. Remove duplicates, handle missing values..."

Code Example

# Lines 41-89 in contextual_enrichment.py
async def enrich_chunk(chunk: str, document: str, title: str) -> str:
    """Add contextual prefix to a chunk."""
    prompt = f"""<document>
Title: {title}
{document[:4000]}
</document>

<chunk>
{chunk}
</chunk>

Provide brief context explaining what this chunk discusses.
Format: "This chunk from [title] discusses [explanation]." """

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=150
    )

    context = response.choices[0].message.content.strip()
    return f"{context}\n\n{chunk}"

Enable with: python -m ingestion.ingest --documents ./docs --contextual

See:

Full guide: IMPLEMENTATION_GUIDE.md
Pseudocode: 04_contextual_retrieval.py
Research: docs/04-contextual-retrieval.md

5. Query Expansion

Status: ✅ Code Example

File: rag_agent_advanced.py (Lines 72-107)

What It Is

Expands a single brief query into a more detailed, comprehensive version by adding context, related terms, and clarifying intent. Uses an LLM with a system prompt that describes how to enrich the query while maintaining the original intent.

Example:

Input: "What is RAG?"
Output: "What is Retrieval-Augmented Generation (RAG), how does it combine information retrieval with language generation, what are its key components and architecture, and what advantages does it provide for question-answering systems?"

Pros & Cons

✅ Improved retrieval precision by adding relevant context and specificity

❌ Extra LLM call adds latency, may over-specify simple queries

Code Example

# Query expansion using system prompt to guide enrichment
async def expand_query(ctx: RunContext[None], query: str) -> str:
    """Expand a brief query into a more detailed, comprehensive version."""
    system_prompt = """You are a query expansion assistant. Take brief user queries and expand them into more detailed, comprehensive versions that:
1. Add relevant context and clarifications
2. Include related terminology and concepts
3. Specify what aspects should be covered
4. Maintain the original intent
5. Keep it as a single, coherent question

Expand the query to be 2-3x more detailed while staying focused."""

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Expand this query: {query}"}
        ],
        temperature=0.3
    )

    expanded_query = response.choices[0].message.content.strip()
    return expanded_query  # Returns ONE enhanced query

Note: This strategy returns ONE enriched query. For generating multiple query variations, see Multi-Query RAG (Strategy 6).

See:

6. Multi-Query RAG

Status: ✅ Code Example

File: rag_agent_advanced.py (Lines 114-187)

What It Is

Generates multiple different query variations/perspectives with an LLM (e.g., 3-4 variations), runs all searches concurrently, and deduplicates results. Unlike Query Expansion which enriches ONE query, this creates MULTIPLE distinct phrasings to capture different angles.

Pros & Cons

✅ Comprehensive coverage, better recall on ambiguous queries

❌ 4x database queries (though parallelized), higher cost

Code Example

# Lines 114-187 in rag_agent_advanced.py
async def search_with_multi_query(query: str, limit: int = 5) -> str:
    """Search using multiple query variations in parallel."""
    # Generate variations
    queries = await expand_query_variations(query)  # Returns 4 queries

    # Execute all searches in parallel
    search_tasks = []
    for q in queries:
        query_embedding = await embedder.embed_query(q)
        task = db.fetch("SELECT * FROM match_chunks($1::vector, $2)", query_embedding, limit)
        search_tasks.append(task)

    results_lists = await asyncio.gather(*search_tasks)

    # Deduplicate by chunk ID, keep highest similarity
    seen = {}
    for results in results_lists:
        for row in results:
            if row['chunk_id'] not in seen or row['similarity'] > seen[row['chunk_id']]['similarity']:
                seen[row['chunk_id']] = row

    return format_results(sorted(seen.values(), key=lambda x: x['similarity'], reverse=True)[:limit])

Key Features:

Parallel execution with asyncio.gather()
Smart deduplication (keeps best score per chunk)

See:

7. Context-Aware Chunking

Status: ✅ Code Example (Default)

File: ingestion/chunker.py (Lines 70-102)

What It Is

Intelligent document splitting that uses semantic similarity and document structure analysis to find natural chunk boundaries, rather than naive fixed-size splitting. This approach:

Analyzes document structure (headings, sections, paragraphs, tables)
Uses semantic analysis to identify topic boundaries
Respects linguistic coherence within chunks
Preserves hierarchical context (e.g., heading information)

Implementation Example: Docling's HybridChunker demonstrates this strategy through:

Token-aware chunking (uses actual tokenizer, not estimates)
Document structure preservation
Semantic coherence
Heading context inclusion

Pros & Cons

✅ Free, fast, maintains document structure

❌ Slightly more complex than naive chunking

Code Example

# Lines 70-102 in chunker.py
from docling.chunking import HybridChunker
from transformers import AutoTokenizer

class DoclingHybridChunker:
    def __init__(self, config: ChunkingConfig):
        # Initialize tokenizer for token-aware chunking
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

        # Create HybridChunker
        self.chunker = HybridChunker(
            tokenizer=self.tokenizer,
            max_tokens=config.max_tokens,
            merge_peers=True  # Merge small adjacent chunks
        )

    async def chunk_document(self, docling_doc: DoclingDocument) -> List[DocumentChunk]:
        # Use HybridChunker to chunk the DoclingDocument
        chunks = list(self.chunker.chunk(dl_doc=docling_doc))

        # Contextualize each chunk (includes heading hierarchy)
        for chunk in chunks:
            contextualized_text = self.chunker.contextualize(chunk=chunk)
            # Store contextualized text as chunk content

Enabled by default during ingestion

See:

Full guide: IMPLEMENTATION_GUIDE.md
Pseudocode: 07_context_aware_chunking.py
Research: docs/07-context-aware-chunking.md

8. Late Chunking

Status: 📝 Pseudocode Only

Why not in code examples: Docling HybridChunker provides similar benefits

What It Is

Embed the full document through transformer first, then chunk the token embeddings (not the text). Preserves full document context in each chunk's embedding.

Pros & Cons

✅ Maintains full document context, leverages long-context models

❌ More complex than standard chunking

Pseudocode Concept

# From 08_late_chunking.py
def late_chunk(text: str, chunk_size=512) -> list:
    """Process full document through transformer BEFORE chunking."""
    # Step 1: Embed entire document (up to 8192 tokens)
    full_doc_token_embeddings = transformer_embed(text)  # Token-level embeddings

    # Step 2: Define chunk boundaries
    tokens = text.split()
    chunk_boundaries = range(0, len(tokens), chunk_size)

    # Step 3: Pool token embeddings for each chunk
    chunks_with_embeddings = []
    for start in chunk_boundaries:
        end = start + chunk_size
        chunk_text = ' '.join(tokens[start:end])

        # Mean pool the token embeddings (preserves full doc context!)
        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])
        chunks_with_embeddings.append((chunk_text, chunk_embedding))

    return chunks_with_embeddings

Alternative: Use Context-Aware Chunking (Docling) + Contextual Retrieval for similar benefits

See:

Pseudocode: 08_late_chunking.py
Research: docs/08-late-chunking.md

9. Hierarchical RAG

Status: 📝 Pseudocode Only

Why not in code examples: Agentic RAG achieves similar goals for this demo

What It Is

Parent-child chunk relationships: Search small chunks for precision, return large parent chunks for context.

Metadata Enhancement: Can store metadata like section_type ("summary", "table", "detail") and heading_path to intelligently decide when to return just the child vs. the parent, or to include heading context.

Pros & Cons

✅ Balances precision (search small) with context (return big)

❌ Requires parent-child database schema

Pseudocode Concept

# From 09_hierarchical_rag.py
def ingest_hierarchical(document: str, doc_title: str):
    """Create parent-child chunk structure with simple metadata."""
    parent_chunks = [document[i:i+2000] for i in range(0, len(document), 2000)]

    for parent_id, parent in enumerate(parent_chunks):
        # Store parent with metadata (section type, heading)
        metadata = {"heading": f"{doc_title} - Section {parent_id}", "type": "detail"}
        db.execute("INSERT INTO parent_chunks (id, content, metadata) VALUES (%s, %s, %s)",
                   (parent_id, parent, metadata))

        # Children: Small chunks with parent_id
        child_chunks = [parent[j:j+500] for j in range(0, len(parent), 500)]
        for child in child_chunks:
            embedding = get_embedding(child)
            db.execute(
                "INSERT INTO child_chunks (content, embedding, parent_id) VALUES (%s, %s, %s)",
                (child, embedding, parent_id)
            )

@agent.tool
def hierarchical_search(query: str) -> str:
    """Search children, return parents with heading context."""
    query_emb = get_embedding(query)

    # Find matching children and their parent metadata
    results = db.query(
        """SELECT p.content, p.metadata
           FROM child_chunks c
           JOIN parent_chunks p ON c.parent_id = p.id
           ORDER BY c.embedding <=> %s LIMIT 3""",
        query_emb
    )

    # Return parents with heading context
    return "\n\n".join([f"[{r['metadata']['heading']}]\n{r['content']}" for r in results])

Alternative: Use Agentic RAG (semantic search + full document retrieval) for similar flexibility

See:

Pseudocode: 09_hierarchical_rag.py
Research: docs/09-hierarchical-rag.md

10. Self-Reflective RAG

Status: ✅ Code Example

File: rag_agent_advanced.py (Lines 361-482)

What It Is

Self-correcting search loop:

Perform initial search
LLM grades relevance (1-5 scale)
If score < 3, refine query and search again

Pros & Cons

✅ Self-correcting, improves over time

❌ Highest latency (2-3 LLM calls), most expensive

Code Example

# Lines 361-482 in rag_agent_advanced.py
async def search_with_self_reflection(query: str, limit: int = 5) -> str:
    """Self-reflective search: evaluate and refine if needed."""
    # Initial search
    results = await vector_search(query, limit)

    # Grade relevance
    grade_prompt = f"""Query: {query}
Retrieved: {results[:200]}...

Grade relevance 1-5. Respond with number only."""

    grade_response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": grade_prompt}],
        temperature=0
    )
    grade_score = int(grade_response.choices[0].message.content.split()[0])

    # If low relevance, refine and re-search
    if grade_score < 3:
        refine_prompt = f"""Query "{query}" returned low-relevance results.
Suggest improved query. Respond with query only."""

        refined_query = await client.chat.completions.create(...)
        results = await vector_search(refined_query, limit)
        note = f"[Refined from '{query}' to '{refined_query}']"

    return format_results(results, note)

See:

Full guide: IMPLEMENTATION_GUIDE.md
Pseudocode: 10_self_reflective_rag.py
Research: docs/10-self-reflective-rag.md

11. Fine-tuned Embeddings

Status: 📝 Pseudocode Only

Why not in code examples: Requires domain-specific training data and infrastructure

What It Is

Train embedding models on domain-specific query-document pairs to improve retrieval accuracy for specialized domains (medical, legal, financial, etc.).

Pros & Cons

✅ 5-10% accuracy gains, smaller models can outperform larger generic ones

❌ Requires training data, infrastructure, ongoing maintenance

Pseudocode Concept

# From 11_fine_tuned_embeddings.py
from sentence_transformers import SentenceTransformer

def prepare_training_data():
    """Create domain-specific query-document pairs."""
    return [
        ("What is EBITDA?", "financial_doc_about_ebitda.txt"),
        ("Explain capital expenditure", "capex_explanation.txt"),
        # ... thousands more domain-specific pairs
    ]

def fine_tune_model():
    """Fine-tune on domain data (one-time process)."""
    base_model = SentenceTransformer('all-MiniLM-L6-v2')
    training_data = prepare_training_data()

    # Train with MultipleNegativesRankingLoss
    fine_tuned_model = base_model.fit(
        training_data,
        epochs=3,
        loss=MultipleNegativesRankingLoss()
    )

    fine_tuned_model.save('./fine_tuned_model')

# Load fine-tuned model for embeddings
embedding_model = SentenceTransformer('./fine_tuned_model')

def get_embedding(text: str):
    """Use fine-tuned model for embeddings."""
    return embedding_model.encode(text)

Alternative: Use high-quality generic models (OpenAI text-embedding-3-small) and Contextual Retrieval

See:

Pseudocode: 11_fine_tuned_embeddings.py
Research: docs/11-fine-tuned-embeddings.md

📊 Performance Comparison

Ingestion Strategies

Strategy	Speed	Cost	Quality	Status
Simple Chunking	⚡⚡⚡	$	⭐⭐	✅ Available
Context-Aware (Docling)	⚡⚡	$	⭐⭐⭐⭐	✅ Default
Contextual Enrichment	⚡	$$$	⭐⭐⭐⭐⭐	✅ Optional
Late Chunking	⚡⚡	$	⭐⭐⭐⭐	📝 Pseudocode
Hierarchical	⚡⚡	$	⭐⭐⭐⭐	📝 Pseudocode

Query Strategies

Strategy	Latency	Cost	Precision	Recall	Status
Standard Search	⚡⚡⚡	$	⭐⭐⭐	⭐⭐⭐	✅ Default
Query Expansion	⚡⚡	$$	⭐⭐⭐	⭐⭐⭐⭐	✅ Multi-Query
Multi-Query	⚡⚡	$$	⭐⭐⭐	⭐⭐⭐⭐⭐	✅ Code Example
Re-ranking	⚡⚡	$$	⭐⭐⭐⭐⭐	⭐⭐⭐	✅ Code Example
Agentic	⚡⚡	$$	⭐⭐⭐⭐	⭐⭐⭐⭐	✅ Code Example
Self-Reflective	⚡	$$$	⭐⭐⭐⭐	⭐⭐⭐⭐	✅ Code Example
Knowledge Graphs	⚡⚡	$$$	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	📝 Pseudocode

📂 Repository Structure

all-rag-strategies/
├── README.md                           # This file
├── docs/                               # Detailed research (theory + use cases)
│   ├── 01-reranking.md
│   ├── 02-agentic-rag.md
│   ├── ... (all 11 strategies)
│   └── 11-fine-tuned-embeddings.md
│
├── examples/                           # Simple < 50 line examples
│   ├── 01_reranking.py
│   ├── 02_agentic_rag.py
│   ├── ... (all 11 strategies)
│   ├── 11_fine_tuned_embeddings.py
│   └── README.md
│
└── implementation/                     # Educational code examples (NOT production)
    ├── rag_agent.py                    # Basic agent (single tool)
    ├── rag_agent_advanced.py           # Advanced agent (all strategies)
    ├── ingestion/
    │   ├── ingest.py                   # Main ingestion pipeline
    │   ├── chunker.py                  # Docling HybridChunker
    │   ├── embedder.py                 # OpenAI embeddings
    │   └── contextual_enrichment.py    # Anthropic's contextual retrieval
    ├── utils/
    │   ├── db_utils.py
    │   └── models.py
    ├── IMPLEMENTATION_GUIDE.md         # Exact line numbers + code
    ├── STRATEGIES.md                   # Detailed strategy documentation
    └── requirements-advanced.txt

🛠️ Tech Stack

Component	Technology	Purpose
Agent Framework	Pydantic AI	Type-safe agents with tool calling
Vector Database	PostgreSQL + pgvector via Neon	Vector similarity search (Neon used for demonstrations)
Document Processing	Docling	Hybrid chunking + multi-format
Embeddings	OpenAI text-embedding-3-small	1536-dim embeddings
Re-ranking	sentence-transformers	Cross-encoder for precision
LLM	OpenAI GPT-4o-mini	Query expansion, grading, refinement

📚 Additional Resources

Implementation Details: implementation/IMPLEMENTATION_GUIDE.md
Strategy Theory: docs/ (11 detailed docs)
Code Examples: examples/README.md
Anthropic's Contextual Retrieval: https://www.anthropic.com/news/contextual-retrieval
Graphiti (Knowledge Graphs): https://github.com/getzep/graphiti
Pydantic AI Docs: https://ai.pydantic.dev/

🧪 Testing

Automated Tests

The repository includes comprehensive test suites:

LoRA-SHIFT Paper Ingestion Tests:

cd implementation
pytest test_lora_shift_ingestion.py -v

Test Coverage:

✅ Paper structure validation (17 tests)
✅ Chunking logic verification
✅ Embedding dimension checks
✅ Retrieval query validation
✅ Error handling scenarios
✅ Metadata extraction

Run all tests:

pytest -v --tb=short

Manual Testing

Test ingestion pipeline:

# Ingest test paper
cd implementation
python -m ingestion.ingest --documents documents/LoRA-SHIFT-Final-Research-Paper.md

# Expected output:
# ✓ Processed 1 document
# ✓ Created ~30-50 chunks
# ✓ Generated embeddings
# ✓ Stored in database

Test retrieval with sample queries:

python rag_agent_advanced.py

# Try these queries:
# - "What is LoRA-SHIFT?"
# - "How does LoRA-SHIFT improve over standard LoRA?"
# - "What datasets were used in LoRA-SHIFT experiments?"
# - "What is the computational overhead of LoRA-SHIFT?"

🐛 Troubleshooting

Quick fixes for common issues:

Setup Issues

# No module found
pip install -r implementation/requirements-advanced.txt

# Database connection failed
# Check DATABASE_URL in .env

# pgvector extension not found
sudo apt-get install postgresql-16-pgvector
psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS vector;"

Ingestion Issues

# Documents not processing
# - Check file format (PDF, DOCX, MD, TXT supported)
# - Verify files exist in documents/ folder
# - Check file permissions

# Out of memory
# - Process files one at a time
# - Use smaller chunk sizes
# - Reduce batch size

Retrieval Issues

# No results returned
# - Verify data exists: SELECT COUNT(*) FROM chunks;
# - Check embeddings: SELECT COUNT(*) FROM chunks WHERE embedding IS NOT NULL;
# - Rebuild index: python -m ingestion.ingest --documents ./documents

For detailed solutions, see docs/guides/TROUBLESHOOTING.md

🤝 Contributing

This is a demonstration/education project. Feel free to:

Fork and adapt for your use case
Report issues or suggestions
Share your own RAG strategy implementations

🙏 Acknowledgments

Anthropic - Contextual Retrieval methodology
Docling Team - HybridChunker implementation
Jina AI - Late chunking concept
Pydantic Team - Pydantic AI framework
Zep - Graphiti knowledge graph framework
Sentence Transformers - Cross-encoder models

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.vscode		.vscode
docs		docs
examples		examples
implementation		implementation
.DS_Store		.DS_Store
.gitignore		.gitignore
DOCS_ORGANIZATION.md		DOCS_ORGANIZATION.md
LICENSE		LICENSE
README.md		README.md
active_config.json		active_config.json

License

ajay-sai/RAG

Folders and files

Latest commit

History

Repository files navigation

Advanced RAG Strategies - Complete Guide

📚 Table of Contents

🎯 Strategy Overview

Legend

📋 Prerequisites

System Requirements

System Dependencies

API Keys

🚀 Quick Start

1. View Pseudocode Examples (No Setup Required)

2. Interactive Strategy Lab (Recommended)

3. CLI Agent (Command Line)

🎓 For Students & Learners

📖 Student Guide

🔧 Troubleshooting Guide

📝 Test Paper

💻 Pseudocode Examples

🏗️ Code Examples

Architecture

📖 Detailed Strategy Guide

✅ Code Examples (Educational)

1. Re-ranking

What It Is

Pros & Cons

Code Example

2. Agentic RAG

What It Is

Pros & Cons

Code Example

3. Knowledge Graphs

What It Is

Pros & Cons

Pseudocode Concept (Graphiti)

4. Contextual Retrieval

What It Is

Pros & Cons

Before/After Example

Code Example

5. Query Expansion

What It Is

Pros & Cons

Code Example

6. Multi-Query RAG

What It Is

Pros & Cons

Code Example

7. Context-Aware Chunking

What It Is

Pros & Cons

Code Example

8. Late Chunking

What It Is

Pros & Cons

Pseudocode Concept

9. Hierarchical RAG

What It Is

Pros & Cons

Pseudocode Concept

10. Self-Reflective RAG

What It Is

Pros & Cons

Code Example

11. Fine-tuned Embeddings

What It Is

Pros & Cons

Pseudocode Concept

📊 Performance Comparison

Ingestion Strategies

Query Strategies

📂 Repository Structure

🛠️ Tech Stack

📚 Additional Resources

🧪 Testing

Automated Tests

Packages