A comprehensive resource for understanding and implementing advanced Retrieval-Augmented Generation strategies.
This repository demonstrates 16 RAG strategies with:
- π Detailed theory and research (docs/)
- π» Simple pseudocode examples (examples/)
- π§ Full code examples (implementation/)
- π Student learning guide (docs/guides/STUDENT_GUIDE.md)
- π§ Troubleshooting guide (docs/guides/TROUBLESHOOTING.md)
Perfect for: AI engineers, ML practitioners, data science students, and anyone building RAG systems.
- Strategy Overview
- Quick Start
- Prerequisites β
- For Students & Learners β NEW
- Pseudocode Examples
- Code Examples
- Detailed Strategy Guide
- Repository Structure
- Testing β NEW
- Troubleshooting β NEW
| # | Strategy | Status | Use Case | Pros | Cons |
|---|---|---|---|---|---|
| 1 | Re-ranking | β Code Example | Precision-critical | Highly accurate results | Slower, more compute |
| 2 | Agentic RAG | β Code Example | Flexible retrieval needs | Autonomous tool selection | More complex logic |
| 3 | Knowledge Graphs | π Pseudocode Only | Relationship-heavy | Captures connections | Infrastructure overhead |
| 4 | Contextual Retrieval | β Code Example | Critical documents | 35-49% better accuracy | High ingestion cost |
| 5 | Query Expansion | β Code Example | Ambiguous queries | Better recall, multiple perspectives | Extra LLM call, higher cost |
| 6 | Multi-Query RAG | β Code Example | Broad searches | Comprehensive coverage | Multiple API calls |
| 7 | Context-Aware Chunking | β Code Example | All documents | Semantic coherence | Slightly slower ingestion |
| 8 | Late Chunking | π Pseudocode Only | Context preservation | Full document context | Requires long-context models |
| 9 | Hierarchical RAG | π Pseudocode Only | Complex documents | Precision + context | Complex setup |
| 10 | Self-Reflective RAG | β Code Example | Research queries | Self-correcting | Highest latency |
| 11 | Fine-tuned Embeddings | π Pseudocode Only | Domain-specific | Best accuracy | Training required |
| 12 | Hybrid Retrieval | β Code Example | Keyword-sensitive | Balanced recall | More complex infra |
| 13 | Fact Verification | β Code Example | High-stakes domains | Traceability | Higher latency |
| 14 | Multi-hop Reasoning | β Code Example | Complex questions | Solves compositional queries | Expensive |
| 15 | Uncertainty Estimation | β Code Example | Risk-sensitive apps | Trustworthy outputs | More compute |
| 16 | Adaptive Chunking | β Code Example | Heterogeneous docs | Better precision | Complex ingestion |
- β
Code Example: Full code in
implementation/(educational, not production-ready)
- Python 3.9+
- PostgreSQL with pgvector extension
Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y \
ffmpeg \
build-essential \
gcc \
postgresql-client \
libpq-devmacOS:
brew install ffmpeg postgresql
# Xcode Command Line Tools (includes gcc, build tools)
xcode-select --installWindows:
- ffmpeg: Download from ffmpeg.org
- PostgreSQL: Download from postgresql.org
- Build tools: Install Visual Studio Build Tools
What each package does:
ffmpeg- Audio/video processing for Whisper transcriptionbuild-essential&gcc- Compilers for building Python packages (psycopg2, etc.)postgresql-client- PostgreSQL command-line tools (psql)libpq-dev- PostgreSQL development headers for psycopg2
- OpenAI API Key for embeddings and LLM
- Get from: platform.openai.com/api-keys
- π Pseudocode Only: Conceptual examples in
examples/
cd examples
# Browse simple, < 50 line examples for each strategy
cat 01_reranking.pyTry strategies side-by-side with a web UI!
cd implementation
# Install dependencies
pip install -r requirements-advanced.txt
# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY
# Initialize database
psql $DATABASE_URL < sql/schema.sql
# Ingest sample documents
python -m ingestion.ingest --documents ./documents
# Launch Streamlit app
streamlit run app.pyThe app opens at http://localhost:8501
Features:
- π§ͺ Strategy Lab: Compare up to 3 strategies side-by-side
- π Visual Metrics: See latency, tokens, and costs for each approach
- π File Upload: Test with your own documents
- π‘ Educational: Tooltips explain each strategy's trade-offs
Quick guide: docs/implementation/QUICK_START.md
Note: These are educational examples to show how strategies work in real code. Not guaranteed to be fully functional or production-ready.
cd implementation
# Install dependencies (if not done above)
pip install -r requirements-advanced.txt
# Setup environment
cp .env.example .env
# Edit .env: Add DATABASE_URL and OPENAI_API_KEY
# Ingest documents
python -m ingestion.ingest --documents ./documents --chunker adaptive
# Run the advanced agent
python rag_agent_advanced.pyNew to RAG? Start here!
This repository includes comprehensive learning resources:
docs/guides/STUDENT_GUIDE.md - Your complete learning path:
- Structured 9-week curriculum from beginner to advanced
- Core concepts explained with examples
- Practical exercises and project ideas
- Common pitfalls and how to avoid them
- Production deployment considerations
Quick learning path:
Week 1-2: Basics (chunking, embeddings, vector search)
Week 3-4: Query enhancement (expansion, multi-query)
Week 5-6: Advanced retrieval (hybrid, reranking, self-reflective)
Week 7-8: Generation enhancement (fact verification, multi-hop)
Week 9+: Specialized topics (knowledge graphs, fine-tuning)
docs/guides/TROUBLESHOOTING.md - Solutions to common issues:
- Setup problems (dependencies, database, API keys)
- Ingestion errors (file processing, embeddings, memory)
- Retrieval issues (no results, low relevance, slow queries)
- Agent problems (hallucinations, tool calling)
- Testing and debugging
LoRA-SHIFT Research Paper - A comprehensive test document:
- 19,000+ characters of technical content
- Structured research paper format
- Perfect for testing RAG strategies
- Includes abstract, methodology, results, and appendices
All strategies have simple, working pseudocode examples in examples/.
Each file is < 50 lines and demonstrates:
- Core concept
- How to implement with Pydantic AI
- Integration with PG Vector
Example (05_query_expansion.py):
from pydantic_ai import Agent
import psycopg2
from pgvector.psycopg2 import register_vector
agent = Agent('openai:gpt-4o', system_prompt='RAG assistant with query expansion')
@agent.tool
def expand_query(query: str) -> list[str]:
"""Expand single query into multiple variations"""
expansion_prompt = f"Generate 3 variations of: '{query}'"
variations = llm_generate(expansion_prompt)
return [query] + variations
@agent.tool
def search_knowledge_base(queries: list[str]) -> str:
"""Search vector DB with multiple queries"""
all_results = []
for query in queries:
query_embedding = get_embedding(query)
results = db.query('SELECT * FROM chunks ORDER BY embedding <=> %s', query_embedding)
all_results.extend(results)
return deduplicate(all_results)Browse all pseudocode: examples/README.md
β οΈ Important Note: Theimplementation/folder contains educational code examples based on a real implementation, not production-ready. These strategies are added to demonstrate concepts and show how they work in real code. They are not guaranteed to be fully working and it's not ideal to have all strategies in one codebase (which is why I haven't refined this specifically for production use). Use these as learning references and starting points for your own implementations. Think of this as an "off-the-shelf RAG implementation" with strategies added for demonstration purposes. Use as inspiration for your own production systems.
implementation/
βββ rag_agent_advanced.py # Agent with all strategy examples
βββ ingestion/
β βββ ingest.py # Document ingestion pipeline
β βββ chunker.py # Context-aware chunking (Docling)
β βββ embedder.py # OpenAI embeddings
β βββ contextual_enrichment.py # Anthropic's contextual retrieval
βββ utils/
β βββ db_utils.py # Database utilities
β βββ models.py # Pydantic models
βββ IMPLEMENTATION_GUIDE.md # Detailed implementation reference
Tech Stack:
- Pydantic AI - Agent framework
- PostgreSQL + pgvector - Vector search
- Docling - Hybrid chunking
- OpenAI - Embeddings and LLM
Status: β Code Example
File: rag_agent_advanced.py (Lines 194-256)
Two-stage retrieval: Vector search (20-50+ candidates) β Reranking model to filter (top 5).
β Significantly better precision, more knowledge considered without overwhelming LLM
β Slightly slower than pure vector search, uses more compute
# Lines 194-256 in rag_agent_advanced.py
async def search_with_reranking(ctx: RunContext[None], query: str, limit: int = 5) -> str:
"""Two-stage retrieval with cross-encoder re-ranking."""
initialize_reranker() # Loads cross-encoder/ms-marco-MiniLM-L-6-v2
# Stage 1: Fast vector retrieval (retrieve 20 candidates)
candidate_limit = min(limit * 4, 20)
results = await vector_search(query, candidate_limit)
# Stage 2: Re-rank with cross-encoder
pairs = [[query, row['content']] for row in results]
scores = reranker.predict(pairs)
# Sort by new scores and return top N
reranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True)[:limit]
return format_results(reranked)Model: cross-encoder/ms-marco-MiniLM-L-6-v2
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 01_reranking.py
- Research: docs/01-reranking.md
Status: β Code Example
Files: rag_agent_advanced.py (Lines 263-354)
Agent autonomously chooses between multiple retrieval tools, example:
search_knowledge_base()- Semantic search over chunks (can include hybrid search: dense vector + sparse keyword/BM25)retrieve_full_document()- Pull entire documents when chunks aren't enough
Note: Hybrid search (combining dense vector embeddings with sparse keyword search like BM25) is typically implemented as part of the agentic retrieval strategy, giving the agent access to both semantic similarity and keyword matching.
β Flexible, adapts to query needs automatically
β More complex, less predictable behavior
# Tool 1: Semantic search (Lines 263-305)
@agent.tool
async def search_knowledge_base(query: str, limit: int = 5) -> str:
"""Standard semantic search over document chunks."""
query_embedding = await embedder.embed_query(query)
results = await db.match_chunks(query_embedding, limit)
return format_results(results)
# Tool 2: Full document retrieval (Lines 308-354)
@agent.tool
async def retrieve_full_document(document_title: str) -> str:
"""Retrieve complete document when chunks lack context."""
result = await db.query(
"SELECT title, content FROM documents WHERE title ILIKE %s",
f"%{document_title}%"
)
return f"**{result['title']}**\n\n{result['content']}"Example Flow:
User: "What's the full refund policy?"
Agent:
1. Calls search_knowledge_base("refund policy")
2. Finds chunks mentioning "refund_policy.pdf"
3. Calls retrieve_full_document("refund policy")
4. Returns complete document
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 02_agentic_rag.py
- Research: docs/02-agentic-rag.md
Status: π Pseudocode Only (Graphiti)
Why not in code examples: Requires Neo4j infrastructure, entity extraction
Combines vector search with graph databases (such as Neo4j/FalkorDB) to capture entity relationships.
β Captures relationships vectors miss, great for interconnected data
β Requires Neo4j setup, entity extraction, graph maintenance, slower and more expensive
# From 03_knowledge_graphs.py (with Graphiti)
from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
# Initialize Graphiti (connects to Neo4j)
graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")
async def ingest_document(text: str, source: str):
"""Ingest document into Graphiti knowledge graph."""
# Graphiti automatically extracts entities and relationships
await graphiti.add_episode(
name=source,
episode_body=text,
source=EpisodeType.text,
source_description=f"Document: {source}"
)
@agent.tool
async def search_knowledge_graph(query: str) -> str:
"""Hybrid search: semantic + keyword + graph traversal."""
# Graphiti combines:
# - Semantic similarity (embeddings)
# - BM25 keyword search
# - Graph structure traversal
# - Temporal context (when was this true?)
results = await graphiti.search(query=query, num_results=5)
return format_graph_results(results)Framework: Graphiti from Zep - Temporal knowledge graphs for agents
See:
- Pseudocode: 03_knowledge_graphs.py
- Research: docs/03-knowledge-graphs.md
Status: β Code Example (Optional)
File: ingestion/contextual_enrichment.py (Lines 41-89)
Anthropic's method: Adds document-level context to each chunk before embedding. LLM generates 1-2 sentences explaining what the chunk discusses in relation to the whole document.
β 35-49% reduction in retrieval failures, chunks are self-contained
β Expensive (1 LLM call per chunk), slower ingestion
BEFORE:
"Clean data is essential. Remove duplicates, handle missing values..."
AFTER:
"This chunk from 'ML Best Practices' discusses data preparation techniques
for machine learning workflows.
Clean data is essential. Remove duplicates, handle missing values..."
# Lines 41-89 in contextual_enrichment.py
async def enrich_chunk(chunk: str, document: str, title: str) -> str:
"""Add contextual prefix to a chunk."""
prompt = f"""<document>
Title: {title}
{document[:4000]}
</document>
<chunk>
{chunk}
</chunk>
Provide brief context explaining what this chunk discusses.
Format: "This chunk from [title] discusses [explanation]." """
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=150
)
context = response.choices[0].message.content.strip()
return f"{context}\n\n{chunk}"Enable with: python -m ingestion.ingest --documents ./docs --contextual
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 04_contextual_retrieval.py
- Research: docs/04-contextual-retrieval.md
Status: β Code Example
File: rag_agent_advanced.py (Lines 72-107)
Expands a single brief query into a more detailed, comprehensive version by adding context, related terms, and clarifying intent. Uses an LLM with a system prompt that describes how to enrich the query while maintaining the original intent.
Example:
- Input: "What is RAG?"
- Output: "What is Retrieval-Augmented Generation (RAG), how does it combine information retrieval with language generation, what are its key components and architecture, and what advantages does it provide for question-answering systems?"
β Improved retrieval precision by adding relevant context and specificity
β Extra LLM call adds latency, may over-specify simple queries
# Query expansion using system prompt to guide enrichment
async def expand_query(ctx: RunContext[None], query: str) -> str:
"""Expand a brief query into a more detailed, comprehensive version."""
system_prompt = """You are a query expansion assistant. Take brief user queries and expand them into more detailed, comprehensive versions that:
1. Add relevant context and clarifications
2. Include related terminology and concepts
3. Specify what aspects should be covered
4. Maintain the original intent
5. Keep it as a single, coherent question
Expand the query to be 2-3x more detailed while staying focused."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Expand this query: {query}"}
],
temperature=0.3
)
expanded_query = response.choices[0].message.content.strip()
return expanded_query # Returns ONE enhanced queryNote: This strategy returns ONE enriched query. For generating multiple query variations, see Multi-Query RAG (Strategy 6).
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 05_query_expansion.py
- Research: docs/05-query-expansion.md
Status: β Code Example
File: rag_agent_advanced.py (Lines 114-187)
Generates multiple different query variations/perspectives with an LLM (e.g., 3-4 variations), runs all searches concurrently, and deduplicates results. Unlike Query Expansion which enriches ONE query, this creates MULTIPLE distinct phrasings to capture different angles.
β Comprehensive coverage, better recall on ambiguous queries
β 4x database queries (though parallelized), higher cost
# Lines 114-187 in rag_agent_advanced.py
async def search_with_multi_query(query: str, limit: int = 5) -> str:
"""Search using multiple query variations in parallel."""
# Generate variations
queries = await expand_query_variations(query) # Returns 4 queries
# Execute all searches in parallel
search_tasks = []
for q in queries:
query_embedding = await embedder.embed_query(q)
task = db.fetch("SELECT * FROM match_chunks($1::vector, $2)", query_embedding, limit)
search_tasks.append(task)
results_lists = await asyncio.gather(*search_tasks)
# Deduplicate by chunk ID, keep highest similarity
seen = {}
for results in results_lists:
for row in results:
if row['chunk_id'] not in seen or row['similarity'] > seen[row['chunk_id']]['similarity']:
seen[row['chunk_id']] = row
return format_results(sorted(seen.values(), key=lambda x: x['similarity'], reverse=True)[:limit])Key Features:
- Parallel execution with
asyncio.gather() - Smart deduplication (keeps best score per chunk)
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 06_multi_query_rag.py
- Research: docs/06-multi-query-rag.md
Status: β Code Example (Default)
File: ingestion/chunker.py (Lines 70-102)
Intelligent document splitting that uses semantic similarity and document structure analysis to find natural chunk boundaries, rather than naive fixed-size splitting. This approach:
- Analyzes document structure (headings, sections, paragraphs, tables)
- Uses semantic analysis to identify topic boundaries
- Respects linguistic coherence within chunks
- Preserves hierarchical context (e.g., heading information)
Implementation Example: Docling's HybridChunker demonstrates this strategy through:
- Token-aware chunking (uses actual tokenizer, not estimates)
- Document structure preservation
- Semantic coherence
- Heading context inclusion
β Free, fast, maintains document structure
β Slightly more complex than naive chunking
# Lines 70-102 in chunker.py
from docling.chunking import HybridChunker
from transformers import AutoTokenizer
class DoclingHybridChunker:
def __init__(self, config: ChunkingConfig):
# Initialize tokenizer for token-aware chunking
self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Create HybridChunker
self.chunker = HybridChunker(
tokenizer=self.tokenizer,
max_tokens=config.max_tokens,
merge_peers=True # Merge small adjacent chunks
)
async def chunk_document(self, docling_doc: DoclingDocument) -> List[DocumentChunk]:
# Use HybridChunker to chunk the DoclingDocument
chunks = list(self.chunker.chunk(dl_doc=docling_doc))
# Contextualize each chunk (includes heading hierarchy)
for chunk in chunks:
contextualized_text = self.chunker.contextualize(chunk=chunk)
# Store contextualized text as chunk contentEnabled by default during ingestion
See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 07_context_aware_chunking.py
- Research: docs/07-context-aware-chunking.md
Status: π Pseudocode Only
Why not in code examples: Docling HybridChunker provides similar benefits
Embed the full document through transformer first, then chunk the token embeddings (not the text). Preserves full document context in each chunk's embedding.
β Maintains full document context, leverages long-context models
β More complex than standard chunking
# From 08_late_chunking.py
def late_chunk(text: str, chunk_size=512) -> list:
"""Process full document through transformer BEFORE chunking."""
# Step 1: Embed entire document (up to 8192 tokens)
full_doc_token_embeddings = transformer_embed(text) # Token-level embeddings
# Step 2: Define chunk boundaries
tokens = text.split()
chunk_boundaries = range(0, len(tokens), chunk_size)
# Step 3: Pool token embeddings for each chunk
chunks_with_embeddings = []
for start in chunk_boundaries:
end = start + chunk_size
chunk_text = ' '.join(tokens[start:end])
# Mean pool the token embeddings (preserves full doc context!)
chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])
chunks_with_embeddings.append((chunk_text, chunk_embedding))
return chunks_with_embeddingsAlternative: Use Context-Aware Chunking (Docling) + Contextual Retrieval for similar benefits
See:
- Pseudocode: 08_late_chunking.py
- Research: docs/08-late-chunking.md
Status: π Pseudocode Only
Why not in code examples: Agentic RAG achieves similar goals for this demo
Parent-child chunk relationships: Search small chunks for precision, return large parent chunks for context.
Metadata Enhancement: Can store metadata like section_type ("summary", "table", "detail") and heading_path to intelligently decide when to return just the child vs. the parent, or to include heading context.
β Balances precision (search small) with context (return big)
β Requires parent-child database schema
# From 09_hierarchical_rag.py
def ingest_hierarchical(document: str, doc_title: str):
"""Create parent-child chunk structure with simple metadata."""
parent_chunks = [document[i:i+2000] for i in range(0, len(document), 2000)]
for parent_id, parent in enumerate(parent_chunks):
# Store parent with metadata (section type, heading)
metadata = {"heading": f"{doc_title} - Section {parent_id}", "type": "detail"}
db.execute("INSERT INTO parent_chunks (id, content, metadata) VALUES (%s, %s, %s)",
(parent_id, parent, metadata))
# Children: Small chunks with parent_id
child_chunks = [parent[j:j+500] for j in range(0, len(parent), 500)]
for child in child_chunks:
embedding = get_embedding(child)
db.execute(
"INSERT INTO child_chunks (content, embedding, parent_id) VALUES (%s, %s, %s)",
(child, embedding, parent_id)
)
@agent.tool
def hierarchical_search(query: str) -> str:
"""Search children, return parents with heading context."""
query_emb = get_embedding(query)
# Find matching children and their parent metadata
results = db.query(
"""SELECT p.content, p.metadata
FROM child_chunks c
JOIN parent_chunks p ON c.parent_id = p.id
ORDER BY c.embedding <=> %s LIMIT 3""",
query_emb
)
# Return parents with heading context
return "\n\n".join([f"[{r['metadata']['heading']}]\n{r['content']}" for r in results])Alternative: Use Agentic RAG (semantic search + full document retrieval) for similar flexibility
See:
- Pseudocode: 09_hierarchical_rag.py
- Research: docs/09-hierarchical-rag.md
Status: β Code Example
File: rag_agent_advanced.py (Lines 361-482)
Self-correcting search loop:
- Perform initial search
- LLM grades relevance (1-5 scale)
- If score < 3, refine query and search again
β Self-correcting, improves over time
β Highest latency (2-3 LLM calls), most expensive
# Lines 361-482 in rag_agent_advanced.py
async def search_with_self_reflection(query: str, limit: int = 5) -> str:
"""Self-reflective search: evaluate and refine if needed."""
# Initial search
results = await vector_search(query, limit)
# Grade relevance
grade_prompt = f"""Query: {query}
Retrieved: {results[:200]}...
Grade relevance 1-5. Respond with number only."""
grade_response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": grade_prompt}],
temperature=0
)
grade_score = int(grade_response.choices[0].message.content.split()[0])
# If low relevance, refine and re-search
if grade_score < 3:
refine_prompt = f"""Query "{query}" returned low-relevance results.
Suggest improved query. Respond with query only."""
refined_query = await client.chat.completions.create(...)
results = await vector_search(refined_query, limit)
note = f"[Refined from '{query}' to '{refined_query}']"
return format_results(results, note)See:
- Full guide: IMPLEMENTATION_GUIDE.md
- Pseudocode: 10_self_reflective_rag.py
- Research: docs/10-self-reflective-rag.md
Status: π Pseudocode Only
Why not in code examples: Requires domain-specific training data and infrastructure
Train embedding models on domain-specific query-document pairs to improve retrieval accuracy for specialized domains (medical, legal, financial, etc.).
β 5-10% accuracy gains, smaller models can outperform larger generic ones
β Requires training data, infrastructure, ongoing maintenance
# From 11_fine_tuned_embeddings.py
from sentence_transformers import SentenceTransformer
def prepare_training_data():
"""Create domain-specific query-document pairs."""
return [
("What is EBITDA?", "financial_doc_about_ebitda.txt"),
("Explain capital expenditure", "capex_explanation.txt"),
# ... thousands more domain-specific pairs
]
def fine_tune_model():
"""Fine-tune on domain data (one-time process)."""
base_model = SentenceTransformer('all-MiniLM-L6-v2')
training_data = prepare_training_data()
# Train with MultipleNegativesRankingLoss
fine_tuned_model = base_model.fit(
training_data,
epochs=3,
loss=MultipleNegativesRankingLoss()
)
fine_tuned_model.save('./fine_tuned_model')
# Load fine-tuned model for embeddings
embedding_model = SentenceTransformer('./fine_tuned_model')
def get_embedding(text: str):
"""Use fine-tuned model for embeddings."""
return embedding_model.encode(text)Alternative: Use high-quality generic models (OpenAI text-embedding-3-small) and Contextual Retrieval
See:
- Pseudocode: 11_fine_tuned_embeddings.py
- Research: docs/11-fine-tuned-embeddings.md
| Strategy | Speed | Cost | Quality | Status |
|---|---|---|---|---|
| Simple Chunking | β‘β‘β‘ | $ | ββ | β Available |
| Context-Aware (Docling) | β‘β‘ | $ | ββββ | β Default |
| Contextual Enrichment | β‘ | $$$ | βββββ | β Optional |
| Late Chunking | β‘β‘ | $ | ββββ | π Pseudocode |
| Hierarchical | β‘β‘ | $ | ββββ | π Pseudocode |
| Strategy | Latency | Cost | Precision | Recall | Status |
|---|---|---|---|---|---|
| Standard Search | β‘β‘β‘ | $ | βββ | βββ | β Default |
| Query Expansion | β‘β‘ | $$ | βββ | ββββ | β Multi-Query |
| Multi-Query | β‘β‘ | $$ | βββ | βββββ | β Code Example |
| Re-ranking | β‘β‘ | $$ | βββββ | βββ | β Code Example |
| Agentic | β‘β‘ | $$ | ββββ | ββββ | β Code Example |
| Self-Reflective | β‘ | $$$ | ββββ | ββββ | β Code Example |
| Knowledge Graphs | β‘β‘ | $$$ | βββββ | ββββ | π Pseudocode |
all-rag-strategies/
βββ README.md # This file
βββ docs/ # Detailed research (theory + use cases)
β βββ 01-reranking.md
β βββ 02-agentic-rag.md
β βββ ... (all 11 strategies)
β βββ 11-fine-tuned-embeddings.md
β
βββ examples/ # Simple < 50 line examples
β βββ 01_reranking.py
β βββ 02_agentic_rag.py
β βββ ... (all 11 strategies)
β βββ 11_fine_tuned_embeddings.py
β βββ README.md
β
βββ implementation/ # Educational code examples (NOT production)
βββ rag_agent.py # Basic agent (single tool)
βββ rag_agent_advanced.py # Advanced agent (all strategies)
βββ ingestion/
β βββ ingest.py # Main ingestion pipeline
β βββ chunker.py # Docling HybridChunker
β βββ embedder.py # OpenAI embeddings
β βββ contextual_enrichment.py # Anthropic's contextual retrieval
βββ utils/
β βββ db_utils.py
β βββ models.py
βββ IMPLEMENTATION_GUIDE.md # Exact line numbers + code
βββ STRATEGIES.md # Detailed strategy documentation
βββ requirements-advanced.txt
| Component | Technology | Purpose |
|---|---|---|
| Agent Framework | Pydantic AI | Type-safe agents with tool calling |
| Vector Database | PostgreSQL + pgvector via Neon | Vector similarity search (Neon used for demonstrations) |
| Document Processing | Docling | Hybrid chunking + multi-format |
| Embeddings | OpenAI text-embedding-3-small | 1536-dim embeddings |
| Re-ranking | sentence-transformers | Cross-encoder for precision |
| LLM | OpenAI GPT-4o-mini | Query expansion, grading, refinement |
- Implementation Details: implementation/IMPLEMENTATION_GUIDE.md
- Strategy Theory: docs/ (11 detailed docs)
- Code Examples: examples/README.md
- Anthropic's Contextual Retrieval: https://www.anthropic.com/news/contextual-retrieval
- Graphiti (Knowledge Graphs): https://github.com/getzep/graphiti
- Pydantic AI Docs: https://ai.pydantic.dev/
The repository includes comprehensive test suites:
LoRA-SHIFT Paper Ingestion Tests:
cd implementation
pytest test_lora_shift_ingestion.py -vTest Coverage:
- β Paper structure validation (17 tests)
- β Chunking logic verification
- β Embedding dimension checks
- β Retrieval query validation
- β Error handling scenarios
- β Metadata extraction
Run all tests:
pytest -v --tb=shortTest ingestion pipeline:
# Ingest test paper
cd implementation
python -m ingestion.ingest --documents documents/LoRA-SHIFT-Final-Research-Paper.md
# Expected output:
# β Processed 1 document
# β Created ~30-50 chunks
# β Generated embeddings
# β Stored in databaseTest retrieval with sample queries:
python rag_agent_advanced.py
# Try these queries:
# - "What is LoRA-SHIFT?"
# - "How does LoRA-SHIFT improve over standard LoRA?"
# - "What datasets were used in LoRA-SHIFT experiments?"
# - "What is the computational overhead of LoRA-SHIFT?"Quick fixes for common issues:
# No module found
pip install -r implementation/requirements-advanced.txt
# Database connection failed
# Check DATABASE_URL in .env
# pgvector extension not found
sudo apt-get install postgresql-16-pgvector
psql $DATABASE_URL -c "CREATE EXTENSION IF NOT EXISTS vector;"# Documents not processing
# - Check file format (PDF, DOCX, MD, TXT supported)
# - Verify files exist in documents/ folder
# - Check file permissions
# Out of memory
# - Process files one at a time
# - Use smaller chunk sizes
# - Reduce batch size# No results returned
# - Verify data exists: SELECT COUNT(*) FROM chunks;
# - Check embeddings: SELECT COUNT(*) FROM chunks WHERE embedding IS NOT NULL;
# - Rebuild index: python -m ingestion.ingest --documents ./documentsFor detailed solutions, see docs/guides/TROUBLESHOOTING.md
This is a demonstration/education project. Feel free to:
- Fork and adapt for your use case
- Report issues or suggestions
- Share your own RAG strategy implementations
- Anthropic - Contextual Retrieval methodology
- Docling Team - HybridChunker implementation
- Jina AI - Late chunking concept
- Pydantic Team - Pydantic AI framework
- Zep - Graphiti knowledge graph framework
- Sentence Transformers - Cross-encoder models