Skip to content

ajay-sai/RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Advanced RAG Strategies - Complete Guide

A comprehensive resource for understanding and implementing advanced Retrieval-Augmented Generation strategies.

This repository demonstrates 16 RAG strategies with:

Perfect for: AI engineers, ML practitioners, data science students, and anyone building RAG systems.


πŸ“š Table of Contents

  1. Strategy Overview
  2. Quick Start
  3. Prerequisites ⭐
  4. For Students & Learners ⭐ NEW
  5. Pseudocode Examples
  6. Code Examples
  7. Detailed Strategy Guide
  8. Repository Structure
  9. Testing ⭐ NEW
  10. Troubleshooting ⭐ NEW

🎯 Strategy Overview

# Strategy Status Use Case Pros Cons
1 Re-ranking βœ… Code Example Precision-critical Highly accurate results Slower, more compute
2 Agentic RAG βœ… Code Example Flexible retrieval needs Autonomous tool selection More complex logic
3 Knowledge Graphs πŸ“ Pseudocode Only Relationship-heavy Captures connections Infrastructure overhead
4 Contextual Retrieval βœ… Code Example Critical documents 35-49% better accuracy High ingestion cost
5 Query Expansion βœ… Code Example Ambiguous queries Better recall, multiple perspectives Extra LLM call, higher cost
6 Multi-Query RAG βœ… Code Example Broad searches Comprehensive coverage Multiple API calls
7 Context-Aware Chunking βœ… Code Example All documents Semantic coherence Slightly slower ingestion
8 Late Chunking πŸ“ Pseudocode Only Context preservation Full document context Requires long-context models
9 Hierarchical RAG πŸ“ Pseudocode Only Complex documents Precision + context Complex setup
10 Self-Reflective RAG βœ… Code Example Research queries Self-correcting Highest latency
11 Fine-tuned Embeddings πŸ“ Pseudocode Only Domain-specific Best accuracy Training required
12 Hybrid Retrieval βœ… Code Example Keyword-sensitive Balanced recall More complex infra
13 Fact Verification βœ… Code Example High-stakes domains Traceability Higher latency
14 Multi-hop Reasoning βœ… Code Example Complex questions Solves compositional queries Expensive
15 Uncertainty Estimation βœ… Code Example Risk-sensitive apps Trustworthy outputs More compute
16 Adaptive Chunking βœ… Code Example Heterogeneous docs Better precision Complex ingestion

Legend

  • βœ… Code Example: Full code in implementation/ (educational, not production-ready)

πŸ“‹ Prerequisites

System Requirements

  • Python 3.9+
  • PostgreSQL with pgvector extension
    • Cloud options: Neon, Supabase
    • Self-hosted: PostgreSQL 12+ with pgvector

System Dependencies

Ubuntu/Debian:

sudo apt-get update && sudo apt-get install -y \
    ffmpeg \
    build-essential \
    gcc \
    postgresql-client \
    libpq-dev

macOS:

brew install ffmpeg postgresql
# Xcode Command Line Tools (includes gcc, build tools)
xcode-select --install

Windows:

What each package does:

  • ffmpeg - Audio/video processing for Whisper transcription
  • build-essential & gcc - Compilers for building Python packages (psycopg2, etc.)
  • postgresql-client - PostgreSQL command-line tools (psql)
  • libpq-dev - PostgreSQL development headers for psycopg2

API Keys


πŸš€ Quick Start

1. View Pseudocode Examples (No Setup Required)

cd examples
# Browse simple, < 50 line examples for each strategy
cat 01_reranking.py

2. Interactive Strategy Lab (Recommended)

Try strategies side-by-side with a web UI!

cd implementation

# Install dependencies
pip install -r requirements-advanced.txt

# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY

# Initialize database
psql $DATABASE_URL < sql/schema.sql

# Ingest sample documents
python -m ingestion.ingest --documents ./documents

# Launch Streamlit app
streamlit run app.py

The app opens at http://localhost:8501

Features:

  • πŸ§ͺ Strategy Lab: Compare up to 3 strategies side-by-side
  • πŸ“Š Visual Metrics: See latency, tokens, and costs for each approach
  • πŸ“ File Upload: Test with your own documents
  • πŸ’‘ Educational: Tooltips explain each strategy's trade-offs

Quick guide: docs/implementation/QUICK_START.md

3. CLI Agent (Command Line)

Note: These are educational examples to show how strategies work in real code. Not guaranteed to be fully functional or production-ready.

cd implementation

# Install dependencies (if not done above)
pip install -r requirements-advanced.txt

# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY

# Ingest documents
python -m ingestion.ingest --documents ./documents --chunker adaptive

# Run the advanced agent
python rag_agent_advanced.py

πŸŽ“ For Students & Learners

New to RAG? Start here!

This repository includes comprehensive learning resources:

πŸ“– Student Guide

docs/guides/STUDENT_GUIDE.md - Your complete learning path:

  • Structured 9-week curriculum from beginner to advanced
  • Core concepts explained with examples
  • Practical exercises and project ideas
  • Common pitfalls and how to avoid them
  • Production deployment considerations

Quick learning path:

Week 1-2:  Basics (chunking, embeddings, vector search)
Week 3-4:  Query enhancement (expansion, multi-query)
Week 5-6:  Advanced retrieval (hybrid, reranking, self-reflective)
Week 7-8:  Generation enhancement (fact verification, multi-hop)
Week 9+:   Specialized topics (knowledge graphs, fine-tuning)

πŸ”§ Troubleshooting Guide

docs/guides/TROUBLESHOOTING.md - Solutions to common issues:

  • Setup problems (dependencies, database, API keys)
  • Ingestion errors (file processing, embeddings, memory)
  • Retrieval issues (no results, low relevance, slow queries)
  • Agent problems (hallucinations, tool calling)
  • Testing and debugging

πŸ“ Test Paper

LoRA-SHIFT Research Paper - A comprehensive test document:

  • 19,000+ characters of technical content
  • Structured research paper format
  • Perfect for testing RAG strategies
  • Includes abstract, methodology, results, and appendices

πŸ’» Pseudocode Examples

All strategies have simple, working pseudocode examples in examples/.

Each file is < 50 lines and demonstrates:

  • Core concept
  • How to implement with Pydantic AI
  • Integration with PG Vector

Example (05_query_expansion.py):

from pydantic_ai import Agent
import psycopg2
from pgvector.psycopg2 import register_vector

agent = Agent('openai:gpt-4o', system_prompt='RAG assistant with query expansion')

@agent.tool
def expand_query(query: str) -> list[str]:
    """Expand single query into multiple variations"""
    expansion_prompt = f"Generate 3 variations of: '{query}'"
    variations = llm_generate(expansion_prompt)
    return [query] + variations

@agent.tool
def search_knowledge_base(queries: list[str]) -> str:
    """Search vector DB with multiple queries"""
    all_results = []
    for query in queries:
        query_embedding = get_embedding(query)
        results = db.query('SELECT * FROM chunks ORDER BY embedding <=> %s', query_embedding)
        all_results.extend(results)
    return deduplicate(all_results)

Browse all pseudocode: examples/README.md


πŸ—οΈ Code Examples

⚠️ Important Note: The implementation/ folder contains educational code examples based on a real implementation, not production-ready. These strategies are added to demonstrate concepts and show how they work in real code. They are not guaranteed to be fully working and it's not ideal to have all strategies in one codebase (which is why I haven't refined this specifically for production use). Use these as learning references and starting points for your own implementations. Think of this as an "off-the-shelf RAG implementation" with strategies added for demonstration purposes. Use as inspiration for your own production systems.

Architecture

implementation/
β”œβ”€β”€ rag_agent_advanced.py          # Agent with all strategy examples
β”œβ”€β”€ ingestion/
β”‚   β”œβ”€β”€ ingest.py                  # Document ingestion pipeline
β”‚   β”œβ”€β”€ chunker.py                 # Context-aware chunking (Docling)
β”‚   β”œβ”€β”€ embedder.py                # OpenAI embeddings
β”‚   └── contextual_enrichment.py   # Anthropic's contextual retrieval
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ db_utils.py                # Database utilities
β”‚   └── models.py                  # Pydantic models
└── IMPLEMENTATION_GUIDE.md        # Detailed implementation reference

Tech Stack:

  • Pydantic AI - Agent framework
  • PostgreSQL + pgvector - Vector search
  • Docling - Hybrid chunking
  • OpenAI - Embeddings and LLM

πŸ“– Detailed Strategy Guide

βœ… Code Examples (Educational)


1. Re-ranking

Status: βœ… Code Example

File: rag_agent_advanced.py (Lines 194-256)

What It Is

Two-stage retrieval: Vector search (20-50+ candidates) β†’ Reranking model to filter (top 5).

Pros & Cons

βœ… Significantly better precision, more knowledge considered without overwhelming LLM

❌ Slightly slower than pure vector search, uses more compute

Code Example

# Lines 194-256 in rag_agent_advanced.py
async def search_with_reranking(ctx: RunContext[None], query: str, limit: int = 5) -> str:
    """Two-stage retrieval with cross-encoder re-ranking."""
    initialize_reranker()  # Loads cross-encoder/ms-marco-MiniLM-L-6-v2

    # Stage 1: Fast vector retrieval (retrieve 20 candidates)
    candidate_limit = min(limit * 4, 20)
    results = await vector_search(query, candidate_limit)

    # Stage 2: Re-rank with cross-encoder
    pairs = [[query, row['content']] for row in results]
    scores = reranker.predict(pairs)

    # Sort by new scores and return top N
    reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)[:limit]
    return format_results(reranked)

Model: cross-encoder/ms-marco-MiniLM-L-6-v2

See:


2. Agentic RAG

Status: βœ… Code Example

Files: rag_agent_advanced.py (Lines 263-354)

What It Is

Agent autonomously chooses between multiple retrieval tools, example:

  1. search_knowledge_base() - Semantic search over chunks (can include hybrid search: dense vector + sparse keyword/BM25)
  2. retrieve_full_document() - Pull entire documents when chunks aren't enough

Note: Hybrid search (combining dense vector embeddings with sparse keyword search like BM25) is typically implemented as part of the agentic retrieval strategy, giving the agent access to both semantic similarity and keyword matching.

Pros & Cons

βœ… Flexible, adapts to query needs automatically

❌ More complex, less predictable behavior

Code Example

# Tool 1: Semantic search (Lines 263-305)
@agent.tool
async def search_knowledge_base(query: str, limit: int = 5) -> str:
    """Standard semantic search over document chunks."""
    query_embedding = await embedder.embed_query(query)
    results = await db.match_chunks(query_embedding, limit)
    return format_results(results)

# Tool 2: Full document retrieval (Lines 308-354)
@agent.tool
async def retrieve_full_document(document_title: str) -> str:
    """Retrieve complete document when chunks lack context."""
    result = await db.query(
        "SELECT title, content FROM documents WHERE title ILIKE %s",
        f"%{document_title}%"
    )
    return f"**{result['title']}**\n\n{result['content']}"

Example Flow:

User: "What's the full refund policy?"
Agent:
  1. Calls search_knowledge_base("refund policy")
  2. Finds chunks mentioning "refund_policy.pdf"
  3. Calls retrieve_full_document("refund policy")
  4. Returns complete document

See:


3. Knowledge Graphs

Status: πŸ“ Pseudocode Only (Graphiti)

Why not in code examples: Requires Neo4j infrastructure, entity extraction

What It Is

Combines vector search with graph databases (such as Neo4j/FalkorDB) to capture entity relationships.

Pros & Cons

βœ… Captures relationships vectors miss, great for interconnected data

❌ Requires Neo4j setup, entity extraction, graph maintenance, slower and more expensive

Pseudocode Concept (Graphiti)

# From 03_knowledge_graphs.py (with Graphiti)
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType

# Initialize Graphiti (connects to Neo4j)
graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):
    """Ingest document into Graphiti knowledge graph."""
    # Graphiti automatically extracts entities and relationships
    await graphiti.add_episode(
        name=source,
        episode_body=text,
        source=EpisodeType.text,
        source_description=f"Document: {source}"
    )

@agent.tool
async def search_knowledge_graph(query: str) -> str:
    """Hybrid search: semantic + keyword + graph traversal."""
    # Graphiti combines:
    # - Semantic similarity (embeddings)
    # - BM25 keyword search
    # - Graph structure traversal
    # - Temporal context (when was this true?)

    results = await graphiti.search(query=query, num_results=5)

    return format_graph_results(results)

Framework: Graphiti from Zep - Temporal knowledge graphs for agents

See:


4. Contextual Retrieval

Status: βœ… Code Example (Optional)

File: ingestion/contextual_enrichment.py (Lines 41-89)

What It Is

Anthropic's method: Adds document-level context to each chunk before embedding. LLM generates 1-2 sentences explaining what the chunk discusses in relation to the whole document.

Pros & Cons

βœ… 35-49% reduction in retrieval failures, chunks are self-contained

❌ Expensive (1 LLM call per chunk), slower ingestion

Before/After Example

BEFORE:
"Clean data is essential. Remove duplicates, handle missing values..."

AFTER:
"This chunk from 'ML Best Practices' discusses data preparation techniques
for machine learning workflows.

Clean data is essential. Remove duplicates, handle missing values..."

Code Example

# Lines 41-89 in contextual_enrichment.py
async def enrich_chunk(chunk: str, document: str, title: str) -> str:
    """Add contextual prefix to a chunk."""
    prompt = f"""<document>
Title: {title}
{document[:4000]}
</document>

<chunk>
{chunk}
</chunk>

Provide brief context explaining what this chunk discusses.
Format: "This chunk from [title] discusses [explanation]." """

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=150
    )

    context = response.choices[0].message.content.strip()
    return f"{context}\n\n{chunk}"

Enable with: python -m ingestion.ingest --documents ./docs --contextual

See:


5. Query Expansion

Status: βœ… Code Example

File: rag_agent_advanced.py (Lines 72-107)

What It Is

Expands a single brief query into a more detailed, comprehensive version by adding context, related terms, and clarifying intent. Uses an LLM with a system prompt that describes how to enrich the query while maintaining the original intent.

Example:

  • Input: "What is RAG?"
  • Output: "What is Retrieval-Augmented Generation (RAG), how does it combine information retrieval with language generation, what are its key components and architecture, and what advantages does it provide for question-answering systems?"

Pros & Cons

βœ… Improved retrieval precision by adding relevant context and specificity

❌ Extra LLM call adds latency, may over-specify simple queries

Code Example

# Query expansion using system prompt to guide enrichment
async def expand_query(ctx: RunContext[None], query: str) -> str:
    """Expand a brief query into a more detailed, comprehensive version."""
    system_prompt = """You are a query expansion assistant. Take brief user queries and expand them into more detailed, comprehensive versions that:
1. Add relevant context and clarifications
2. Include related terminology and concepts
3. Specify what aspects should be covered
4. Maintain the original intent
5. Keep it as a single, coherent question

Expand the query to be 2-3x more detailed while staying focused."""

    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Expand this query: {query}"}
        ],
        temperature=0.3
    )

    expanded_query = response.choices[0].message.content.strip()
    return expanded_query  # Returns ONE enhanced query

Note: This strategy returns ONE enriched query. For generating multiple query variations, see Multi-Query RAG (Strategy 6).

See:


6. Multi-Query RAG

Status: βœ… Code Example

File: rag_agent_advanced.py (Lines 114-187)

What It Is

Generates multiple different query variations/perspectives with an LLM (e.g., 3-4 variations), runs all searches concurrently, and deduplicates results. Unlike Query Expansion which enriches ONE query, this creates MULTIPLE distinct phrasings to capture different angles.

Pros & Cons

βœ… Comprehensive coverage, better recall on ambiguous queries

❌ 4x database queries (though parallelized), higher cost

Code Example

# Lines 114-187 in rag_agent_advanced.py
async def search_with_multi_query(query: str, limit: int = 5) -> str:
    """Search using multiple query variations in parallel."""
    # Generate variations
    queries = await expand_query_variations(query)  # Returns 4 queries

    # Execute all searches in parallel
    search_tasks = []
    for q in queries:
        query_embedding = await embedder.embed_query(q)
        task = db.fetch("SELECT * FROM match_chunks($1::vector, $2)", query_embedding, limit)
        search_tasks.append(task)

    results_lists = await asyncio.gather(*search_tasks)

    # Deduplicate by chunk ID, keep highest similarity
    seen = {}
    for results in results_lists:
        for row in results:
            if row['chunk_id'] not in seen or row['similarity'] > seen[row['chunk_id']]['similarity']:
                seen[row['chunk_id']] = row

    return format_results(sorted(seen.values(), key=lambda x: x['similarity'], reverse=True)[:limit])

Key Features:

  • Parallel execution with asyncio.gather()
  • Smart deduplication (keeps best score per chunk)

See:


7. Context-Aware Chunking

Status: βœ… Code Example (Default)

File: ingestion/chunker.py (Lines 70-102)

What It Is

Intelligent document splitting that uses semantic similarity and document structure analysis to find natural chunk boundaries, rather than naive fixed-size splitting. This approach:

  • Analyzes document structure (headings, sections, paragraphs, tables)
  • Uses semantic analysis to identify topic boundaries
  • Respects linguistic coherence within chunks
  • Preserves hierarchical context (e.g., heading information)

Implementation Example: Docling's HybridChunker demonstrates this strategy through:

  • Token-aware chunking (uses actual tokenizer, not estimates)
  • Document structure preservation
  • Semantic coherence
  • Heading context inclusion

Pros & Cons

βœ… Free, fast, maintains document structure

❌ Slightly more complex than naive chunking

Code Example

# Lines 70-102 in chunker.py
from docling.chunking import HybridChunker
from transformers import AutoTokenizer

class DoclingHybridChunker:
    def __init__(self, config: ChunkingConfig):
        # Initialize tokenizer for token-aware chunking
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

        # Create HybridChunker
        self.chunker = HybridChunker(
            tokenizer=self.tokenizer,
            max_tokens=config.max_tokens,
            merge_peers=True  # Merge small adjacent chunks
        )

    async def chunk_document(self, docling_doc: DoclingDocument) -> List[DocumentChunk]:
        # Use HybridChunker to chunk the DoclingDocument
        chunks = list(self.chunker.chunk(dl_doc=docling_doc))

        # Contextualize each chunk (includes heading hierarchy)
        for chunk in chunks:
            contextualized_text = self.chunker.contextualize(chunk=chunk)
            # Store contextualized text as chunk content

Enabled by default during ingestion

See:


8. Late Chunking

Status: πŸ“ Pseudocode Only

Why not in code examples: Docling HybridChunker provides similar benefits

What It Is

Embed the full document through transformer first, then chunk the token embeddings (not the text). Preserves full document context in each chunk's embedding.

Pros & Cons

βœ… Maintains full document context, leverages long-context models

❌ More complex than standard chunking

Pseudocode Concept

# From 08_late_chunking.py
def late_chunk(text: str, chunk_size=512) -> list:
    """Process full document through transformer BEFORE chunking."""
    # Step 1: Embed entire document (up to 8192 tokens)
    full_doc_token_embeddings = transformer_embed(text)  # Token-level embeddings

    # Step 2: Define chunk boundaries
    tokens = text.split()
    chunk_boundaries = range(0, len(tokens), chunk_size)

    # Step 3: Pool token embeddings for each chunk
    chunks_with_embeddings = []
    for start in chunk_boundaries:
        end = start + chunk_size
        chunk_text = ' '.join(tokens[start:end])

        # Mean pool the token embeddings (preserves full doc context!)
        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])
        chunks_with_embeddings.append((chunk_text, chunk_embedding))

    return chunks_with_embeddings

Alternative: Use Context-Aware Chunking (Docling) + Contextual Retrieval for similar benefits

See:


9. Hierarchical RAG

Status: πŸ“ Pseudocode Only

Why not in code examples: Agentic RAG achieves similar goals for this demo

What It Is

Parent-child chunk relationships: Search small chunks for precision, return large parent chunks for context.

Metadata Enhancement: Can store metadata like section_type ("summary", "table", "detail") and heading_path to intelligently decide when to return just the child vs. the parent, or to include heading context.

Pros & Cons

βœ… Balances precision (search small) with context (return big)

❌ Requires parent-child database schema

Pseudocode Concept

# From 09_hierarchical_rag.py
def ingest_hierarchical(document: str, doc_title: str):
    """Create parent-child chunk structure with simple metadata."""
    parent_chunks = [document[i:i+2000] for i in range(0, len(document), 2000)]

    for parent_id, parent in enumerate(parent_chunks):
        # Store parent with metadata (section type, heading)
        metadata = {"heading": f"{doc_title} - Section {parent_id}", "type": "detail"}
        db.execute("INSERT INTO parent_chunks (id, content, metadata) VALUES (%s, %s, %s)",
                   (parent_id, parent, metadata))

        # Children: Small chunks with parent_id
        child_chunks = [parent[j:j+500] for j in range(0, len(parent), 500)]
        for child in child_chunks:
            embedding = get_embedding(child)
            db.execute(
                "INSERT INTO child_chunks (content, embedding, parent_id) VALUES (%s, %s, %s)",
                (child, embedding, parent_id)
            )

@agent.tool
def hierarchical_search(query: str) -> str:
    """Search children, return parents with heading context."""
    query_emb = get_embedding(query)

    # Find matching children and their parent metadata
    results = db.query(
        """SELECT p.content, p.metadata
           FROM child_chunks c
           JOIN parent_chunks p ON c.parent_id = p.id
           ORDER BY c.embedding <=> %s LIMIT 3""",
        query_emb
    )

    # Return parents with heading context
    return "\n\n".join([f"[{r['metadata']['heading']}]\n{r['content']}" for r in results])

Alternative: Use Agentic RAG (semantic search + full document retrieval) for similar flexibility

See:


10. Self-Reflective RAG

Status: βœ… Code Example

File: rag_agent_advanced.py (Lines 361-482)

What It Is

Self-correcting search loop:

  1. Perform initial search
  2. LLM grades relevance (1-5 scale)
  3. If score < 3, refine query and search again

Pros & Cons

βœ… Self-correcting, improves over time

❌ Highest latency (2-3 LLM calls), most expensive

Code Example

# Lines 361-482 in rag_agent_advanced.py
async def search_with_self_reflection(query: str, limit: int = 5) -> str:
    """Self-reflective search: evaluate and refine if needed."""
    # Initial search
    results = await vector_search(query, limit)

    # Grade relevance
    grade_prompt = f"""Query: {query}
Retrieved: {results[:200]}...

Grade relevance 1-5. Respond with number only."""

    grade_response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": grade_prompt}],
        temperature=0
    )
    grade_score = int(grade_response.choices[0].message.content.split()[0])

    # If low relevance, refine and re-search
    if grade_score < 3:
        refine_prompt = f"""Query "{query}" returned low-relevance results.
Suggest improved query. Respond with query only."""

        refined_query = await client.chat.completions.create(...)
        results = await vector_search(refined_query, limit)
        note = f"[Refined from '{query}' to '{refined_query}']"

    return format_results(results, note)

See:


11. Fine-tuned Embeddings

Status: πŸ“ Pseudocode Only

Why not in code examples: Requires domain-specific training data and infrastructure

What It Is

Train embedding models on domain-specific query-document pairs to improve retrieval accuracy for specialized domains (medical, legal, financial, etc.).

Pros & Cons

βœ… 5-10% accuracy gains, smaller models can outperform larger generic ones

❌ Requires training data, infrastructure, ongoing maintenance

Pseudocode Concept

# From 11_fine_tuned_embeddings.py
from sentence_transformers import SentenceTransformer

def prepare_training_data():
    """Create domain-specific query-document pairs."""
    return [
        ("What is EBITDA?", "financial_doc_about_ebitda.txt"),
        ("Explain capital expenditure", "capex_explanation.txt"),
        # ... thousands more domain-specific pairs
    ]

def fine_tune_model():
    """Fine-tune on domain data (one-time process)."""
    base_model = SentenceTransformer('all-MiniLM-L6-v2')
    training_data = prepare_training_data()

    # Train with MultipleNegativesRankingLoss
    fine_tuned_model = base_model.fit(
        training_data,
        epochs=3,
        loss=MultipleNegativesRankingLoss()
    )

    fine_tuned_model.save('./fine_tuned_model')

# Load fine-tuned model for embeddings
embedding_model = SentenceTransformer('./fine_tuned_model')

def get_embedding(text: str):
    """Use fine-tuned model for embeddings."""
    return embedding_model.encode(text)

Alternative: Use high-quality generic models (OpenAI text-embedding-3-small) and Contextual Retrieval

See:


πŸ“Š Performance Comparison

Ingestion Strategies

Strategy Speed Cost Quality Status
Simple Chunking ⚑⚑⚑ $ ⭐⭐ βœ… Available
Context-Aware (Docling) ⚑⚑ $ ⭐⭐⭐⭐ βœ… Default
Contextual Enrichment ⚑ $$$ ⭐⭐⭐⭐⭐ βœ… Optional
Late Chunking ⚑⚑ $ ⭐⭐⭐⭐ πŸ“ Pseudocode
Hierarchical ⚑⚑ $ ⭐⭐⭐⭐ πŸ“ Pseudocode

Query Strategies

Strategy Latency Cost Precision Recall Status
Standard Search ⚑⚑⚑ $ ⭐⭐⭐ ⭐⭐⭐ βœ… Default
Query Expansion ⚑⚑ $$ ⭐⭐⭐ ⭐⭐⭐⭐ βœ… Multi-Query
Multi-Query ⚑⚑ $$ ⭐⭐⭐ ⭐⭐⭐⭐⭐ βœ… Code Example
Re-ranking ⚑⚑ $$ ⭐⭐⭐⭐⭐ ⭐⭐⭐ βœ… Code Example
Agentic ⚑⚑ $$ ⭐⭐⭐⭐ ⭐⭐⭐⭐ βœ… Code Example
Self-Reflective ⚑ $$$ ⭐⭐⭐⭐ ⭐⭐⭐⭐ βœ… Code Example
Knowledge Graphs ⚑⚑ $$$ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ πŸ“ Pseudocode

πŸ“‚ Repository Structure

all-rag-strategies/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ docs/                               # Detailed research (theory + use cases)
β”‚   β”œβ”€β”€ 01-reranking.md
β”‚   β”œβ”€β”€ 02-agentic-rag.md
β”‚   β”œβ”€β”€ ... (all 11 strategies)
β”‚   └── 11-fine-tuned-embeddings.md
β”‚
β”œβ”€β”€ examples/                           # Simple < 50 line examples
β”‚   β”œβ”€β”€ 01_reranking.py
β”‚   β”œβ”€β”€ 02_agentic_rag.py
β”‚   β”œβ”€β”€ ... (all 11 strategies)
β”‚   β”œβ”€β”€ 11_fine_tuned_embeddings.py
β”‚   └── README.md
β”‚
└── implementation/                     # Educational code examples (NOT production)
    β”œβ”€β”€ rag_agent.py                    # Basic agent (single tool)
    β”œβ”€β”€ rag_agent_advanced.py           # Advanced agent (all strategies)
    β”œβ”€β”€ ingestion/
    β”‚   β”œβ”€β”€ ingest.py                   # Main ingestion pipeline
    β”‚   β”œβ”€β”€ chunker.py                  # Docling HybridChunker
    β”‚   β”œβ”€β”€ embedder.py                 # OpenAI embeddings
    β”‚   └── contextual_enrichment.py    # Anthropic's contextual retrieval
    β”œβ”€β”€ utils/
    β”‚   β”œβ”€β”€ db_utils.py
    β”‚   └── models.py
    β”œβ”€β”€ IMPLEMENTATION_GUIDE.md         # Exact line numbers + code
    β”œβ”€β”€ STRATEGIES.md                   # Detailed strategy documentation
    └── requirements-advanced.txt

πŸ› οΈ Tech Stack

Component Technology Purpose
Agent Framework Pydantic AI Type-safe agents with tool calling
Vector Database PostgreSQL + pgvector via Neon Vector similarity search (Neon used for demonstrations)
Document Processing Docling Hybrid chunking + multi-format
Embeddings OpenAI text-embedding-3-small 1536-dim embeddings
Re-ranking sentence-transformers Cross-encoder for precision
LLM OpenAI GPT-4o-mini Query expansion, grading, refinement

πŸ“š Additional Resources


πŸ§ͺ Testing

Automated Tests

The repository includes comprehensive test suites:

LoRA-SHIFT Paper Ingestion Tests:

cd implementation
pytest test_lora_shift_ingestion.py -v

Test Coverage:

  • βœ… Paper structure validation (17 tests)
  • βœ… Chunking logic verification
  • βœ… Embedding dimension checks
  • βœ… Retrieval query validation
  • βœ… Error handling scenarios
  • βœ… Metadata extraction

Run all tests:

pytest -v --tb=short

Manual Testing

Test ingestion pipeline:

# Ingest test paper
cd implementation
python -m ingestion.ingest --documents documents/LoRA-SHIFT-Final-Research-Paper.md

# Expected output:
# βœ“ Processed 1 document
# βœ“ Created ~30-50 chunks
# βœ“ Generated embeddings
# βœ“ Stored in database

Test retrieval with sample queries:

python rag_agent_advanced.py

# Try these queries:
# - "What is LoRA-SHIFT?"
# - "How does LoRA-SHIFT improve over standard LoRA?"
# - "What datasets were used in LoRA-SHIFT experiments?"
# - "What is the computational overhead of LoRA-SHIFT?"

πŸ› Troubleshooting

Quick fixes for common issues:

Setup Issues

# No module found
pip install -r implementation/requirements-advanced.txt

# Database connection failed
# Check DATABASE_URL in .env

# pgvector extension not found
sudo apt-get install postgresql-16-pgvector
psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS vector;"

Ingestion Issues

# Documents not processing
# - Check file format (PDF, DOCX, MD, TXT supported)
# - Verify files exist in documents/ folder
# - Check file permissions

# Out of memory
# - Process files one at a time
# - Use smaller chunk sizes
# - Reduce batch size

Retrieval Issues

# No results returned
# - Verify data exists: SELECT COUNT(*) FROM chunks;
# - Check embeddings: SELECT COUNT(*) FROM chunks WHERE embedding IS NOT NULL;
# - Rebuild index: python -m ingestion.ingest --documents ./documents

For detailed solutions, see docs/guides/TROUBLESHOOTING.md


🀝 Contributing

This is a demonstration/education project. Feel free to:

  • Fork and adapt for your use case
  • Report issues or suggestions
  • Share your own RAG strategy implementations

πŸ™ Acknowledgments

  • Anthropic - Contextual Retrieval methodology
  • Docling Team - HybridChunker implementation
  • Jina AI - Late chunking concept
  • Pydantic Team - Pydantic AI framework
  • Zep - Graphiti knowledge graph framework
  • Sentence Transformers - Cross-encoder models

About

RAG implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages