A complete end-to-end table-aware document ingestion pipeline for Retrieval-Augmented Generation (RAG) systems that intelligently separates narrative text from structured tables and stores them in specialized databases.
- Overview
- Architecture
- Features
- Requirements
- Installation
- Quick Start
- Querying Your RAG System
- Testing the RAG System
- Viewing Stored Data
- Viewing Parsed Content (Before Database Storage)
- Understanding the Output
- Parsing Strategies
- Project Structure
- Module Descriptions
- Configuration
- Pipeline Workflow
- Database Maintenance
- Docker Management
- Troubleshooting
- Contributing
- License
- Support
- Acknowledgments
This project provides a production-ready document processing system that:
- Extracts tables and narrative text separately from PDF documents
- Stores tables in MongoDB (NoSQL) for exact lookups and structured queries
- Generates embeddings and stores text in Qdrant (Vector DB) for semantic search
- Provides a complete orchestration pipeline with Docker containerization
- Supports migration to Microsoft Fabric (Real-Time Intelligence/Cosmos DB)
PDF Document
↓
Document Parser (ingest.py)
├──> 📊 Tables → MongoDB (localhost:27017)
│ └─> Web UI: Mongo Express (localhost:8081)
│
└──> 📝 Text → Embeddings (embedding.py)
└─> sentence-transformers (384-dim vectors)
└─> Qdrant Vector DB (localhost:6333)
└─> Web UI (localhost:6334)
Orchestration: run_pipeline.py
Databases: docker-compose.yml
Query Interface:
↓
User Question → QueryEngine (query.py)
├──> Vector Search → Qdrant (semantic similarity)
├──> Table Search → MongoDB (filtered by source file)
└──> Context + Question → Ollama (local LLM)
└─> Answer
CLI Tool: ask.py "Your question here"
graph TB
subgraph "Data Ingestion Pipeline"
PDF[📄 PDF Document]
PARSER[Document Parser<br/>ingest.py]
PDF --> PARSER
PARSER -->|Tables<br/>HTML Format| MONGO[(MongoDB<br/>localhost:27017)]
PARSER -->|Text Chunks| EMBED[Embedding Model<br/>embedding.py]
EMBED -->|384-dim vectors<br/>sentence-transformers| QDRANT[(Qdrant Vector DB<br/>localhost:6333)]
MONGO -.->|Web UI| MONGOUI[Mongo Express<br/>localhost:8081]
QDRANT -.->|Web UI| QDRANTUI[Qdrant UI<br/>localhost:6334]
end
subgraph "Query Pipeline"
USER[👤 User Question]
CLI[ask.py CLI Tool]
ENGINE[QueryEngine<br/>query.py]
USER --> CLI
CLI --> ENGINE
ENGINE -->|1. Vector Search<br/>Semantic Similarity| QDRANT
ENGINE -->|2. Table Search<br/>Filtered by Source| MONGO
ENGINE -->|3. Combined Context<br/>+ Question| OLLAMA[🤖 Ollama<br/>Local LLM<br/>llama3:8b]
OLLAMA -->|Generated Answer| ANSWER[💬 Answer]
end
subgraph "Orchestration & Setup"
PIPELINE[run_pipeline.py<br/>End-to-end orchestration]
DOCKER[docker-compose.yml<br/>Database containers]
DOCKER -.->|Manages| MONGO
DOCKER -.->|Manages| QDRANT
end
style PDF fill:#e1f5ff
style MONGO fill:#4caf50,color:#fff
style QDRANT fill:#2196f3,color:#fff
style OLLAMA fill:#ff9800,color:#fff
style ANSWER fill:#4caf50,color:#fff
style ENGINE fill:#9c27b0,color:#fff
flowchart LR
subgraph Input
PDF[📄 PDF Files<br/>in data/]
end
subgraph Processing["Document Processing (ingest.py)"]
UNSTRUCTURED[unstructured.io<br/>PDF Parser]
FILTER[Table Filter<br/>Remove duplicates]
PDF --> UNSTRUCTURED
UNSTRUCTURED -->|Elements| FILTER
end
subgraph "Table Pipeline"
TABLES[📊 Table Elements<br/>HTML Format]
FILTER -->|Tables| TABLES
TABLES --> MONGO[(MongoDB<br/>document_tables<br/>collection)]
end
subgraph "Text Pipeline"
TEXTS[📝 Text Chunks<br/>Filtered clean text]
EMBED[sentence-transformers<br/>all-MiniLM-L6-v2]
VECTORS[384-dim Vectors]
FILTER -->|Text| TEXTS
TEXTS --> EMBED
EMBED --> VECTORS
VECTORS --> QDRANT[(Qdrant<br/>document_chunks<br/>collection)]
end
style MONGO fill:#4caf50,color:#fff
style QDRANT fill:#2196f3,color:#fff
style FILTER fill:#ff9800,color:#fff
sequenceDiagram
participant User
participant CLI as ask.py
participant Engine as QueryEngine
participant Embedder as Embedding Model
participant Qdrant as Qdrant DB
participant Mongo as MongoDB
participant LLM as Ollama (llama3:8b)
User->>CLI: "What was Q4 revenue?"
CLI->>Engine: ask(question)
Note over Engine: Step 1: Vector Search
Engine->>Embedder: Embed question
Embedder-->>Engine: Question vector
Engine->>Qdrant: Search similar chunks (top 3)
Qdrant-->>Engine: Relevant text chunks + source file
Note over Engine: Step 2: Table Retrieval
Engine->>Mongo: Find tables (filtered by source)
Mongo-->>Engine: Relevant tables (up to 5)
Note over Engine: Step 3: Format Context
Engine->>Engine: Convert HTML tables to markdown
Engine->>Engine: Format text chunks
Engine->>Engine: Build LLM prompt
Note over Engine: Step 4: Generate Answer
Engine->>LLM: Prompt with context + question<br/>(temperature=0.0)
LLM-->>Engine: Generated answer
Engine-->>CLI: Answer text
CLI-->>User: Display answer
erDiagram
MONGODB ||--o{ TABLE : stores
QDRANT ||--o{ CHUNK : stores
TABLE {
string table_id
string content
string content_type
string source_filename
dict metadata
}
CHUNK {
uuid id
array vector_384
string text
string source_filename
dict metadata
}
- Smart Table Detection: Automatically identifies and extracts tables from PDFs
- Multiple Parsing Strategies: Choose between 'auto', 'fast', or 'hi_res' for different accuracy/speed tradeoffs
- Flexible Output Formats: Extract tables as HTML or plain text
- Batch Processing: Process entire directories of PDFs at once
- Vector Embeddings: Convert text to 384-dimensional vectors using sentence-transformers
- Dual Database Storage: MongoDB for structured tables, Qdrant for vector search
- Web UIs: Visual interfaces for MongoDB (Mongo Express) and Qdrant
- Docker Containerization: One-command database setup
- Full Pipeline Orchestration: End-to-end processing with
run_pipeline.py - Detailed Metadata: Capture page numbers, coordinates, and file information
- RAG Query Interface: Ask questions and get answers using local LLM (Ollama)
- Hybrid Retrieval: Combines vector search and table lookups for comprehensive answers
- 100% Private: Uses local Ollama models - no data sent to external APIs
- Comprehensive Testing: Sample PDFs and 20+ test cases to validate RAG performance
- Migration Ready: Designed for easy migration to Microsoft Fabric
- Deterministic Query Results
- Added configurable LLM temperature (default: 0.0 for consistency)
- Same question now always produces the same answer
- Critical for testing and production reliability
- Intelligent Table Filtering
- Automatically filters duplicate table data from text chunks
- Reduces LLM confusion by 100%
- Generic solution works for any PDF
- Temperature Presets
TEMPERATURE_DETERMINISTIC(0.0) - Default for factual Q&ATEMPERATURE_BALANCED(0.3) - Slight variation while staying factualTEMPERATURE_CREATIVE(0.8) - For creative tasks
- Test Suite Reliability
- All 20 tests now pass consistently
- Non-deterministic failures eliminated
- Reproducible results for debugging
Usage:
from src.query.query import QueryEngine
# Default: Deterministic mode (recommended)
engine = QueryEngine()
# Or choose a different mode
engine = QueryEngine(temperature=QueryEngine.TEMPERATURE_BALANCED)Python Version: 3.9, 3.10, or 3.11 (3.12+ is NOT supported)
Docker: Required for running MongoDB and Qdrant databases
Ollama: Required for the RAG query interface (optional if you only want to ingest documents)
For Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y \
poppler-utils \
tesseract-ocr \
libmagic1For macOS:
brew install poppler tesseract libmagicIf you don't have Docker installed:
- Ubuntu/Debian: Install Docker Engine
- macOS: Install Docker Desktop
- Windows: Install Docker Desktop
To use the query interface, install Ollama on your local machine:
macOS & Linux:
curl -fsSL https://ollama.com/install.sh | shWindows:
- Download from ollama.com/download
Pull a model:
# Default model used by query.py
ollama pull llama3:8b
# Alternative models
ollama pull mistral
ollama pull llama2Verify Ollama is running:
# Check if Ollama server is running
curl http://localhost:11434/api/tags
# Or simply test it
ollama run llama3:8b "Hello!"git clone https://github.com/tahaislam/hybrid-rag-parser.git
cd hybrid-rag-parser# Create a virtual environment (recommended)
python3.11 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python dependencies
pip install -r requirements.txt
# Verify installation
python check_setup.py# Start MongoDB and Qdrant containers
docker-compose up -d
# Verify containers are running
docker psYou should see three containers running:
hybrid-rag-mongo- MongoDB database (port 27017)hybrid-rag-qdrant- Qdrant vector database (ports 6333, 6334)hybrid-rag-mongo-express- MongoDB web UI (port 8081)
-
MongoDB Web UI (Mongo Express): http://localhost:8081
- Login Username:
admin - Login Password:
pass - Navigate to
hybrid_rag_db→document_tablesto view tables
- Login Username:
-
Qdrant Web UI: http://localhost:6334
- No login required
- View collections and vector points
- Explore embeddings and metadata
Process all PDFs and store everything in databases:
python run_pipeline.pyThis will:
- Load all PDFs from the
data/directory - Extract tables and text from each PDF
- Store tables in MongoDB
- Generate embeddings and store text vectors in Qdrant
- Display progress for each file
from ingest import process_single_pdf
# Process a PDF file
tables, texts = process_single_pdf("data/sample1.pdf")
print(f"Extracted {len(tables)} tables")
print(f"Extracted {len(texts)} text chunks")from ingest import process_directory
# Process all PDFs in the data folder
results = process_directory("data/")
for filename, (tables, texts) in results.items():
print(f"{filename}: {len(tables)} tables, {len(texts)} text chunks")# Process all PDFs in data/ directory
python ingest.py
# Process a specific PDF
python ingest.py path/to/your/document.pdfOnce you've ingested documents using run_pipeline.py, you can ask questions and get answers based on your document contents.
- Documents must be ingested (run
python run_pipeline.pyfirst) - Ollama must be installed and running
- A model must be pulled (default:
llama3:8b)
Use the ask.py CLI tool:
# Ask a question
python ask.py "What are the key findings in the document?"
# Ask about specific data
python ask.py "What were the Q3 revenue numbers?"
# Ask about tables
python ask.py "Summarize the financial results table"Example Output:
Initializing Query Engine...
Connected to Ollama. Using model: llama3:8b
Searching vectors for: 'What are the key findings?'
Found 3 relevant text chunks.
Vector search identified: 'sample1.pdf' as the most relevant document.
Searching tables in MongoDB...
Found 2 relevant tables.
Synthesizing answer with local LLM (Ollama)...
================================================== ANSWER ==================================================
Based on the provided documents, the key findings include:
1. Revenue increased by 15% year-over-year
2. Customer satisfaction scores improved to 4.5/5
3. The new product line exceeded expectations
====================================================================================================
Use the QueryEngine programmatically:
from query import QueryEngine
# Initialize the engine
engine = QueryEngine()
# Ask a question
answer = engine.ask("What are the payment terms?")
print(answer)
# The engine automatically:
# 1. Searches for semantically similar text chunks
# 2. Identifies the most relevant source file
# 3. Retrieves tables from that specific file
# 4. Combines context and generates an answerThe QueryEngine implements a hybrid retrieval strategy:
- Vector Search (Qdrant): Finds the 3 most semantically similar text chunks to your question
- Smart File Detection: Identifies which source file is most relevant from the vector results
- Targeted Table Retrieval (MongoDB): Fetches tables only from that specific file
- Context Building: Combines text chunks and tables into a rich context
- Answer Generation (Ollama): Sends context + question to local LLM for synthesis
Why this approach?
- Avoids irrelevant table data from unrelated documents
- Provides both narrative context and structured data
- Keeps all processing 100% local and private
- No API keys or external services needed
Edit query.py to customize:
# Change the LLM model (line 23)
self.llm_model = 'mistral' # or 'llama2', 'codellama', etc.
# Adjust number of text chunks retrieved (line 44)
limit=5 # default is 3
# Adjust number of tables retrieved (line 58)
.limit(10) # default is 5
# Change Ollama server URL (line 22)
self.llm_client = Client(host='http://your-server:11434')The project includes comprehensive testing tools to validate RAG performance:
1. Generate Sample PDFs
First, install the PDF generation library:
pip install reportlabThen generate 5 diverse sample PDFs:
python generate_sample_pdfs.pyThis creates sample PDFs with various table types:
project_budget.pdf- Project budget and timelinefinancial_report.pdf- Quarterly revenue and expensesresearch_results.pdf- ML model performance dataproduct_specs.pdf- Hardware specificationssales_report.pdf- Regional sales data
2. Ingest Sample Data
Process the sample PDFs:
python run_pipeline.py3. Run Comprehensive Tests
Execute 20+ test cases:
python test_rag_queries.pyThe test suite validates:
- Simple table lookups (e.g., "What is the estimated hours for software development?")
- Row/column intersections (e.g., "What was Q4 revenue for Cloud Services?")
- Best performer identification (e.g., "Which ML model had highest accuracy?")
- Multi-value extractions (e.g., "List all project phases")
- Comparison queries (e.g., "Compare Random Forest and XGBoost models")
Test Output Example:
TEST: Simple Table Lookup - Single Value
QUESTION: What is the estimated hours for software development?
ANSWER: Based on the project budget table, the estimated hours for
software development is 160 hours.
✓ PASSED: Answer contains expected content
Time taken: 5.23 seconds
TEST SUMMARY
Total tests run: 20
Tests passed: 19
Tests failed: 1
Average response time: 6.45 seconds
For detailed testing documentation, see TESTING.md
After running run_pipeline.py, view your data in the web interfaces:
MongoDB (Tables)
- Open http://localhost:8081 in your browser
- Login with username:
admin, password:pass - Navigate to
hybrid_rag_db→document_tables - Browse extracted tables with all metadata
Qdrant (Vector Embeddings)
- Open http://localhost:6334 in your browser
- Click on the
document_chunkscollection - Click on any point ID to view its details
- Important: Look at the "payload" section to see the actual text
- The "vector" field shows 384 numbers (embedding) - ignore this
- The "payload" field contains:
text,source_filename,chunk_index
- Expand the payload to read the text content
Note: If the Qdrant UI is hard to read, use the helper script instead:
python view_qdrant_data.pyFor easier viewing of Qdrant data, use the included helper script:
# View collection statistics and recent points
python view_qdrant_data.py
# View all text chunks from a specific file
python view_qdrant_data.py view sample1.pdf
# Search for similar text
python view_qdrant_data.py search "payment terms"
# Show detailed statistics
python view_qdrant_data.py statsThis script displays the actual text content in a readable format, without the confusing vector numbers.
from pymongo import MongoClient
# Connect to MongoDB (authSource=admin is required for root user)
client = MongoClient("mongodb://root:examplepassword@localhost:27017/?authSource=admin")
db = client["hybrid_rag_db"]
collection = db["document_tables"]
# Find all tables from a specific file
tables = collection.find({"source_filename": "sample1.pdf"})
for table in tables:
print(f"Table ID: {table['table_id']}")
print(f"Page: {table['metadata']['page_number']}")
print(f"Content: {table['content'][:100]}...")
print()from qdrant_client import QdrantClient
from embedding import EmbeddingModel
# Initialize
client = QdrantClient("localhost", port=6333)
embedder = EmbeddingModel()
# Search for similar content
query = "What are the key findings?"
query_vector = embedder.embed_texts([query])[0]
results = client.search(
collection_name="document_chunks",
query_vector=query_vector,
limit=5
)
for result in results:
print(f"Score: {result.score}")
print(f"Text: {result.payload['text']}")
print(f"Source: {result.payload['source_filename']}")
print()The project includes comprehensive examples:
# Run all examples
python example_usage.pyThis will show you:
- How to process single files
- How to batch process directories
- How to customize parsing settings
- How to prepare data for database storage
from ingest import process_single_pdf
# Process a PDF
tables, texts = process_single_pdf("data/sample1.pdf")
# View table data
for i, table in enumerate(tables):
print(f"\nTable {i+1}:")
print(f" ID: {table['table_id']}")
print(f" Page: {table['metadata']['page_number']}")
print(f" Format: {table['content_type']}")
print(f" Content:\n{table['content']}")
# View text chunks
for i, text in enumerate(texts):
print(f"\nText Chunk {i+1}:")
print(f" {text}")import json
from ingest import process_single_pdf
tables, texts = process_single_pdf("data/sample1.pdf")
# Save tables as JSON
with open("output_tables.json", "w") as f:
json.dump(tables, indent=2, fp=f)
# Save text chunks
with open("output_texts.txt", "w") as f:
for i, text in enumerate(texts, 1):
f.write(f"=== Text Chunk {i} ===\n")
f.write(text + "\n\n")
print("Results saved to output_tables.json and output_texts.txt")Each table is returned as a dictionary with:
{
"table_id": "table_0", # Unique identifier
"content": "<table>...</table>", # Table content (HTML or text)
"content_type": "html", # Format: "html" or "text"
"metadata": {
"page_number": 1, # Page where table appears
"filename": "sample.pdf", # Source file
"file_directory": "data/", # Source directory
"coordinates": {...}, # Position on page
"parent_id": "..." # Document hierarchy
}
}Text is returned as a list of strings, with each string representing a coherent text segment:
[
"This is the document introduction...",
"Section 1: Overview of the project...",
"Key findings include...",
...
]The parser supports three strategies:
| Strategy | Speed | Accuracy | Requirements |
|---|---|---|---|
auto |
Fast | Good | Basic (default) |
fast |
Fastest | Good | Basic |
hi_res |
Slow | Best | Requires layoutparser* |
*Note: hi_res strategy has dependency conflicts. The auto and fast strategies work well for most use cases.
from ingest import DocumentProcessor
processor = DocumentProcessor()
# Default: Good balance
tables, texts = processor.process_pdf("file.pdf", strategy="auto")
# Fast processing
tables, texts = processor.process_pdf("file.pdf", strategy="fast")
# Best accuracy (may fail without layoutparser)
tables, texts = processor.process_pdf("file.pdf", strategy="hi_res")IMPORTANT: Dark backgrounds can prevent table extraction
The unstructured library has difficulty extracting text from tables with dark backgrounds and light text. For best results:
✓ DO:
- Use light backgrounds (white, light grey, light blue)
- Use dark text (black, dark grey)
- Maintain good contrast
✗ AVOID:
- Dark backgrounds (dark blue, dark red, black)
- White/light text on dark backgrounds
- Low contrast combinations
Example - What Works:
# Good: Light background, dark text
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.lightgrey), # ✓
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 1, colors.black)
]))Example - What Fails:
# Bad: Dark background, white text
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.darkblue), # ✗
('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), # ✗
]))
# Result: Empty table extractionIf your PDFs have dark-themed tables that aren't being extracted, regenerate them with lighter styling.
hybrid-rag-parser/
├── src/ # Source code organized by functionality
│ ├── __init__.py
│ ├── ingestion/ # Document processing and embedding
│ │ ├── __init__.py
│ │ ├── ingest.py # PDF parsing and table/text extraction
│ │ └── embedding.py # Text embedding with sentence-transformers
│ ├── database/ # Database connectors
│ │ ├── __init__.py
│ │ └── db_connectors.py # MongoDB and Qdrant connections
│ ├── query/ # RAG query engine
│ │ ├── __init__.py
│ │ └── query.py # Hybrid search with local LLM
│ └── utils/ # Utility scripts
│ ├── __init__.py
│ └── view_qdrant_data.py # View/search Qdrant data
├── tests/ # Test suite and sample data generation
│ ├── __init__.py
│ ├── generate_sample_pdfs.py # Generate 5 diverse test PDFs
│ └── test_rag_queries.py # 20+ automated test cases
├── examples/ # Usage examples
│ ├── __init__.py
│ └── example_usage.py # Document processing examples
├── data/ # PDF files for ingestion
│ ├── sample1.pdf
│ ├── sample2.pdf
│ ├── sample3.pdf
│ ├── project_budget.pdf # Generated test PDF
│ ├── financial_report.pdf
│ ├── research_results.pdf
│ ├── product_specs.pdf
│ └── sales_report.pdf
├── run_pipeline.py # Main orchestration script (run this!)
├── ask.py # CLI tool for asking questions
├── check_setup.py # Installation verification
├── docker-compose.yml # Database containers
├── requirements.txt # Python dependencies
├── README.md # This file
├── SETUP.md # Detailed setup instructions
├── TESTING.md # Testing guide
└── .gitignore # Git ignore rules
| Module | Purpose |
|---|---|
run_pipeline.py |
Main entry point - orchestrates full pipeline |
ask.py |
CLI tool - simple command-line interface for questions |
check_setup.py |
Verify Python dependencies are installed |
| Module | Purpose |
|---|---|
src/ingestion/ingest.py |
PDF parsing, table/text extraction |
src/ingestion/embedding.py |
Generate 384-dim vectors using sentence-transformers |
src/database/db_connectors.py |
MongoDB and Qdrant connection management |
src/query/query.py |
RAG query engine - hybrid search with local LLM |
src/utils/view_qdrant_data.py |
View & search Qdrant data in readable format |
| Module | Purpose |
|---|---|
tests/generate_sample_pdfs.py |
Generate test data - creates 5 sample PDFs with diverse tables |
tests/test_rag_queries.py |
Test suite - 20+ test cases to validate RAG performance |
| Module | Purpose |
|---|---|
examples/example_usage.py |
Usage examples for document processing |
| File | Purpose |
|---|---|
README.md |
Main documentation (this file) |
SETUP.md |
Detailed setup instructions |
TESTING.md |
Testing guide and troubleshooting |
docker-compose.yml |
Database container configuration |
The pipeline uses the following default configurations:
| Component | Setting | Value |
|---|---|---|
| MongoDB | Host | localhost:27017 |
| MongoDB | Database | hybrid_rag_db |
| MongoDB | Collection | document_tables |
| MongoDB | Credentials | root/examplepassword |
| Qdrant | Host | localhost:6333 |
| Qdrant | Collection | document_chunks |
| Qdrant | Distance Metric | Cosine |
| Embedding Model | Name | all-MiniLM-L6-v2 |
| Embedding Model | Vector Size | 384 |
| PDF Directory | Path | ./data |
| Parse Strategy | Default | auto |
| Ollama | Host | localhost:11434 |
| Ollama | Default Model | llama3:8b |
| Mongo Express | Web UI Username | admin |
| Mongo Express | Web UI Password | pass |
Change MongoDB password:
- Edit
docker-compose.yml(lines 14-15 and 36-37) - Edit
db_connectors.py(line 18) - Restart containers:
docker-compose down && docker-compose up -d
Change embedding model:
- Edit
embedding.py(line 17) - Update
vector_sizebased on the new model - Note: Changing models requires recreating the Qdrant collection
Change PDF directory:
- Edit
run_pipeline.py(line 23)
Change parsing strategy:
- Edit
run_pipeline.py(line 59) - Options:
"auto"(recommended),"fast","hi_res" - Note:
"auto"provides the best balance of speed and table detection accuracy
Change Ollama model:
- Edit
query.py(line 23) - Options:
"llama3:8b","mistral","llama2","codellama", etc. - Make sure to pull the model first:
ollama pull <model-name>
When you run python run_pipeline.py, here's what happens:
-
Initialization
- Load sentence-transformer model (all-MiniLM-L6-v2)
- Connect to MongoDB (table storage)
- Connect to Qdrant (vector storage)
- Create/recreate Qdrant collection with 384-dim vectors
-
For Each PDF in data/
- Parse PDF and extract elements
- Separate tables from narrative text
- Store tables in MongoDB with metadata
- Generate embeddings for text chunks
- Store vectors in Qdrant with payloads
-
Results
- All tables queryable in MongoDB
- All text searchable via semantic similarity in Qdrant
- Access via web UIs or programmatic queries
To clear all ingested data (useful before re-ingesting with updated PDFs):
python clear_databases.py
# Type 'yes' to confirmThis will:
- Delete all documents from MongoDB
- Delete and recreate the Qdrant collection
- Provide a clean slate for re-ingestion
When to use:
- After fixing PDF table formatting issues
- Before re-ingesting updated documents
- To start fresh with new data
Complete reset workflow:
# 1. Clear old data
python clear_databases.py
# 2. Regenerate PDFs (if needed)
python generate_sample_pdfs.py
# 3. Re-ingest
python run_pipeline.py
# 4. Test queries
python test_rag_queries.pyAlternative: Docker reset (nuclear option)
# Completely wipe databases and restart containers
docker-compose down -v
docker-compose up -d# Start all services
docker-compose up -d
# Stop all services
docker-compose down
# View logs
docker-compose logs -f
# Restart a specific service
docker-compose restart mongo
# Stop and remove all data (volumes)
docker-compose down -v
# View resource usage
docker stats# MongoDB shell
docker exec -it hybrid-rag-mongo mongosh -u root -p examplepassword
# Check Qdrant collections
curl http://localhost:6333/collectionsIf you see errors about Python version:
# Check your Python version
python --version
# Install Python 3.11 using pyenv
pyenv install 3.11.9
pyenv local 3.11.9
# Or use conda
conda create -n rag-pipeline python=3.11
conda activate rag-pipelineContainers won't start:
# Check if Docker is running
docker --version
docker ps
# View container logs
docker-compose logs
# Restart containers
docker-compose down
docker-compose up -dPort conflicts (address already in use):
# Check which process is using the port
sudo lsof -i :27017 # MongoDB
sudo lsof -i :6333 # Qdrant
sudo lsof -i :8081 # Mongo Express
# Either stop the conflicting service or modify ports in docker-compose.ymlCannot connect to databases:
# Ensure containers are running
docker ps
# Test MongoDB connection
docker exec -it hybrid-rag-mongo mongosh -u root -p examplepassword
# Check Qdrant health
curl http://localhost:6333/healthClear all data and restart:
# Stop containers and remove volumes
docker-compose down -v
# Restart fresh
docker-compose up -dIf you see errors about poppler or tesseract:
# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocr libmagic1
# macOS
brew install poppler tesseract libmagicIf you get ModuleNotFoundError:
# Ensure you're in the virtual environment
source venv/bin/activate # or venv\Scripts\activate on Windows
# Reinstall dependencies
pip install --upgrade pip
pip install -r requirements.txtIf no tables or text are extracted:
- Check that your PDF contains actual text (not just images)
- Check for dark backgrounds in tables - Dark backgrounds with light text prevent extraction (see Table Formatting Best Practices)
- Try a different parsing strategy (
auto,fast, orhi_res) - Verify the PDF isn't password-protected or corrupted
- Test with debug script:
python test_financial_pdf.pyto see what was extracted
Common cause: Tables with dark backgrounds (darkblue, darkred) and white text return empty content. Solution: Regenerate PDFs with light backgrounds.
If run_pipeline.py fails to connect:
- Ensure Docker containers are running:
docker ps - Check database credentials in
db_connectors.pymatchdocker-compose.yml - Wait a few seconds after starting containers for databases to initialize
If you see an error like:
Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18}
The Problem: The MongoDB connection string is missing the authSource parameter.
The Fix: The connection string in db_connectors.py should include ?authSource=admin:
MongoClient("mongodb://root:examplepassword@localhost:27017/?authSource=admin")This tells MongoDB to authenticate against the admin database where the root user is stored. This fix is already applied in the latest version.
If you see strange encoded text in the Qdrant web UI:
The Problem: You're looking at the "vector" field (384 floating-point numbers) instead of the actual text.
The Solution:
- In the Qdrant UI, click on a point ID
- Scroll down to the "payload" section
- Expand the payload to see:
text: The actual readable textsource_filename: Which PDF it came fromchunk_index: Position in the document
Better Option: Use the helper script for easier viewing:
python view_qdrant_data.pyThis displays the text in a clean, readable format without the vector numbers.
If you get "OLLAMA CONNECTION FAILED" when running query.py or ask.py:
Check if Ollama is running:
# Test connection
curl http://localhost:11434/api/tags
# If not running, start Ollama
ollama serve # Linux/macOS
# Or launch the Ollama app on macOS/WindowsVerify model is pulled:
# List available models
ollama list
# If llama3:8b is missing, pull it
ollama pull llama3:8bConnection refused errors:
- Ensure Ollama is installed:
ollama --version - Check if port 11434 is blocked by firewall
- Try running Ollama explicitly:
ollama serve
Model not found errors:
# Pull the specific model mentioned in the error
ollama pull llama3:8b
# Or use a different model by editing query.py line 23Slow response times:
- First query loads the model into memory (slow)
- Subsequent queries use cached model (faster)
- Consider using smaller models like
llama2:7bormistral:7b - Check system resources:
ollama psto see running models
Alternative models to try:
# Smaller, faster models
ollama pull mistral
ollama pull llama2
# Larger, more accurate models
ollama pull llama3:70b # Requires significant RAM
# List all available models
ollama listContributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - see LICENSE file
For issues and questions:
- Check the SETUP.md file for detailed setup instructions
- Review the example_usage.py for code examples
- Run
python check_setup.pyto verify your installation
Built with:
- unstructured.io - Document parsing and table extraction
- pdf2image - PDF processing
- Tesseract - OCR capabilities
- sentence-transformers - Text embeddings (all-MiniLM-L6-v2)
- MongoDB - NoSQL database for table storage
- Qdrant - Vector database for semantic search
- Ollama - Local LLM inference for RAG queries
- Docker - Containerization
- Mongo Express - MongoDB web interface