Hybrid RAG Parser

A complete end-to-end table-aware document ingestion pipeline for Retrieval-Augmented Generation (RAG) systems that intelligently separates narrative text from structured tables and stores them in specialized databases.

Overview
Architecture
- Detailed Ingestion Flow
- Detailed Query Flow
- Data Storage Schema
Features
Requirements
- System Dependencies
- Docker Installation
- Ollama Installation (For RAG Queries)
Installation
- 1. Clone the Repository
- 2. Set Up Python Environment
- 3. Start the Databases with Docker
- 4. Access the Web UIs (Optional)
Quick Start
- Option 1: Run the Full Pipeline (Recommended)
- Option 2: Process Documents Without Database Storage
Querying Your RAG System
- Prerequisites
- Option 1: Command-Line Interface (Easiest)
- Option 2: Python API
- How It Works: Hybrid RAG
- Customizing the Query Engine
Testing the RAG System
- Generate Sample Data and Run Tests
Viewing Stored Data
- Option 1: Using Web UIs (Easiest)
- Option 2: Use the Qdrant Data Viewer Script (Recommended)
- Option 3: Query Databases Programmatically
Viewing Parsed Content (Before Database Storage)
- Option 1: Run Example Scripts
- Option 2: Access Parsed Data Programmatically
- Option 3: Save Output to Files
Understanding the Output
- Table Structure
- Text Chunks
Parsing Strategies
- Choosing a Strategy
- ⚠️ Table Formatting Best Practices
Project Structure
Module Descriptions
- Main Entry Points (Root Level)
- Source Code (src/)
- Tests (tests/)
- Examples (examples/)
- Documentation
Configuration
- Default Settings
- Customizing Settings
Pipeline Workflow
Database Maintenance
- Clearing All Data
Docker Management
- Useful Commands
- Accessing Database Shells
Troubleshooting
- Python Version Issues
- Docker Issues
- Missing System Dependencies
- Import Errors
- Empty Results
- Database Connection Errors
- MongoDB Authentication Failed Error
- Qdrant UI Shows Weird Characters
- Ollama Connection Issues
Contributing
License
Support
Acknowledgments

Overview

This project provides a production-ready document processing system that:

Extracts tables and narrative text separately from PDF documents
Stores tables in MongoDB (NoSQL) for exact lookups and structured queries
Generates embeddings and stores text in Qdrant (Vector DB) for semantic search
Provides a complete orchestration pipeline with Docker containerization
Supports migration to Microsoft Fabric (Real-Time Intelligence/Cosmos DB)

Architecture

PDF Document
    ↓
Document Parser (ingest.py)
    ├──> 📊 Tables → MongoDB (localhost:27017)
    │                  └─> Web UI: Mongo Express (localhost:8081)
    │
    └──> 📝 Text → Embeddings (embedding.py)
                      └─> sentence-transformers (384-dim vectors)
                          └─> Qdrant Vector DB (localhost:6333)
                               └─> Web UI (localhost:6334)

Orchestration: run_pipeline.py
Databases: docker-compose.yml

Query Interface:
    ↓
User Question → QueryEngine (query.py)
    ├──> Vector Search → Qdrant (semantic similarity)
    ├──> Table Search → MongoDB (filtered by source file)
    └──> Context + Question → Ollama (local LLM)
                                └─> Answer

CLI Tool: ask.py "Your question here"

graph TB
    subgraph "Data Ingestion Pipeline"
        PDF[📄 PDF Document]
        PARSER[Document Parser<br/>ingest.py]
        PDF --> PARSER

        PARSER -->|Tables<br/>HTML Format| MONGO[(MongoDB<br/>localhost:27017)]
        PARSER -->|Text Chunks| EMBED[Embedding Model<br/>embedding.py]

        EMBED -->|384-dim vectors<br/>sentence-transformers| QDRANT[(Qdrant Vector DB<br/>localhost:6333)]

        MONGO -.->|Web UI| MONGOUI[Mongo Express<br/>localhost:8081]
        QDRANT -.->|Web UI| QDRANTUI[Qdrant UI<br/>localhost:6334]
    end

    subgraph "Query Pipeline"
        USER[👤 User Question]
        CLI[ask.py CLI Tool]
        ENGINE[QueryEngine<br/>query.py]

        USER --> CLI
        CLI --> ENGINE

        ENGINE -->|1. Vector Search<br/>Semantic Similarity| QDRANT
        ENGINE -->|2. Table Search<br/>Filtered by Source| MONGO

        ENGINE -->|3. Combined Context<br/>+ Question| OLLAMA[🤖 Ollama<br/>Local LLM<br/>llama3:8b]

        OLLAMA -->|Generated Answer| ANSWER[💬 Answer]
    end

    subgraph "Orchestration & Setup"
        PIPELINE[run_pipeline.py<br/>End-to-end orchestration]
        DOCKER[docker-compose.yml<br/>Database containers]

        DOCKER -.->|Manages| MONGO
        DOCKER -.->|Manages| QDRANT
    end

    style PDF fill:#e1f5ff
    style MONGO fill:#4caf50,color:#fff
    style QDRANT fill:#2196f3,color:#fff
    style OLLAMA fill:#ff9800,color:#fff
    style ANSWER fill:#4caf50,color:#fff
    style ENGINE fill:#9c27b0,color:#fff

Detailed Ingestion Flow

flowchart LR
    subgraph Input
        PDF[📄 PDF Files<br/>in data/]
    end

    subgraph Processing["Document Processing (ingest.py)"]
        UNSTRUCTURED[unstructured.io<br/>PDF Parser]
        FILTER[Table Filter<br/>Remove duplicates]

        PDF --> UNSTRUCTURED
        UNSTRUCTURED -->|Elements| FILTER
    end

    subgraph "Table Pipeline"
        TABLES[📊 Table Elements<br/>HTML Format]
        FILTER -->|Tables| TABLES
        TABLES --> MONGO[(MongoDB<br/>document_tables<br/>collection)]
    end

    subgraph "Text Pipeline"
        TEXTS[📝 Text Chunks<br/>Filtered clean text]
        EMBED[sentence-transformers<br/>all-MiniLM-L6-v2]
        VECTORS[384-dim Vectors]

        FILTER -->|Text| TEXTS
        TEXTS --> EMBED
        EMBED --> VECTORS
        VECTORS --> QDRANT[(Qdrant<br/>document_chunks<br/>collection)]
    end

    style MONGO fill:#4caf50,color:#fff
    style QDRANT fill:#2196f3,color:#fff
    style FILTER fill:#ff9800,color:#fff

Detailed Query Flow

sequenceDiagram
    participant User
    participant CLI as ask.py
    participant Engine as QueryEngine
    participant Embedder as Embedding Model
    participant Qdrant as Qdrant DB
    participant Mongo as MongoDB
    participant LLM as Ollama (llama3:8b)

    User->>CLI: "What was Q4 revenue?"
    CLI->>Engine: ask(question)

    Note over Engine: Step 1: Vector Search
    Engine->>Embedder: Embed question
    Embedder-->>Engine: Question vector
    Engine->>Qdrant: Search similar chunks (top 3)
    Qdrant-->>Engine: Relevant text chunks + source file

    Note over Engine: Step 2: Table Retrieval
    Engine->>Mongo: Find tables (filtered by source)
    Mongo-->>Engine: Relevant tables (up to 5)

    Note over Engine: Step 3: Format Context
    Engine->>Engine: Convert HTML tables to markdown
    Engine->>Engine: Format text chunks
    Engine->>Engine: Build LLM prompt

    Note over Engine: Step 4: Generate Answer
    Engine->>LLM: Prompt with context + question<br/>(temperature=0.0)
    LLM-->>Engine: Generated answer

    Engine-->>CLI: Answer text
    CLI-->>User: Display answer

Data Storage Schema

erDiagram
    MONGODB ||--o{ TABLE : stores
    QDRANT ||--o{ CHUNK : stores

    TABLE {
        string table_id
        string content
        string content_type
        string source_filename
        dict metadata
    }

    CHUNK {
        uuid id
        array vector_384
        string text
        string source_filename
        dict metadata
    }

Features

Smart Table Detection: Automatically identifies and extracts tables from PDFs
Multiple Parsing Strategies: Choose between 'auto', 'fast', or 'hi_res' for different accuracy/speed tradeoffs
Flexible Output Formats: Extract tables as HTML or plain text
Batch Processing: Process entire directories of PDFs at once
Vector Embeddings: Convert text to 384-dimensional vectors using sentence-transformers
Dual Database Storage: MongoDB for structured tables, Qdrant for vector search
Web UIs: Visual interfaces for MongoDB (Mongo Express) and Qdrant
Docker Containerization: One-command database setup
Full Pipeline Orchestration: End-to-end processing with run_pipeline.py
Detailed Metadata: Capture page numbers, coordinates, and file information
RAG Query Interface: Ask questions and get answers using local LLM (Ollama)
Hybrid Retrieval: Combines vector search and table lookups for comprehensive answers
100% Private: Uses local Ollama models - no data sent to external APIs
Comprehensive Testing: Sample PDFs and 20+ test cases to validate RAG performance
Migration Ready: Designed for easy migration to Microsoft Fabric
Deterministic Query Results
- Added configurable LLM temperature (default: 0.0 for consistency)
- Same question now always produces the same answer
- Critical for testing and production reliability
Intelligent Table Filtering
- Automatically filters duplicate table data from text chunks
- Reduces LLM confusion by 100%
- Generic solution works for any PDF
Temperature Presets
- TEMPERATURE_DETERMINISTIC (0.0) - Default for factual Q&A
- TEMPERATURE_BALANCED (0.3) - Slight variation while staying factual
- TEMPERATURE_CREATIVE (0.8) - For creative tasks
Test Suite Reliability
- All 20 tests now pass consistently
- Non-deterministic failures eliminated
- Reproducible results for debugging

Usage:

from src.query.query import QueryEngine

# Default: Deterministic mode (recommended)
engine = QueryEngine()

# Or choose a different mode
engine = QueryEngine(temperature=QueryEngine.TEMPERATURE_BALANCED)

Requirements

Python Version: 3.9, 3.10, or 3.11 (3.12+ is NOT supported)

Docker: Required for running MongoDB and Qdrant databases

Ollama: Required for the RAG query interface (optional if you only want to ingest documents)

System Dependencies

For Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y \
    poppler-utils \
    tesseract-ocr \
    libmagic1

For macOS:

brew install poppler tesseract libmagic

Docker Installation

If you don't have Docker installed:

Ubuntu/Debian: Install Docker Engine
macOS: Install Docker Desktop
Windows: Install Docker Desktop

Ollama Installation (For RAG Queries)

To use the query interface, install Ollama on your local machine:

macOS & Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from ollama.com/download

Pull a model:

# Default model used by query.py
ollama pull llama3:8b

# Alternative models
ollama pull mistral
ollama pull llama2

Verify Ollama is running:

# Check if Ollama server is running
curl http://localhost:11434/api/tags

# Or simply test it
ollama run llama3:8b "Hello!"

Installation

1. Clone the Repository

git clone https://github.com/tahaislam/hybrid-rag-parser.git
cd hybrid-rag-parser

2. Set Up Python Environment

# Create a virtual environment (recommended)
python3.11 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python dependencies
pip install -r requirements.txt

# Verify installation
python check_setup.py

3. Start the Databases with Docker

# Start MongoDB and Qdrant containers
docker-compose up -d

# Verify containers are running
docker ps

You should see three containers running:

hybrid-rag-mongo - MongoDB database (port 27017)
hybrid-rag-qdrant - Qdrant vector database (ports 6333, 6334)
hybrid-rag-mongo-express - MongoDB web UI (port 8081)

4. Access the Web UIs (Optional)

MongoDB Web UI (Mongo Express): http://localhost:8081
- Login Username: admin
- Login Password: pass
- Navigate to hybrid_rag_db → document_tables to view tables
Qdrant Web UI: http://localhost:6334
- No login required
- View collections and vector points
- Explore embeddings and metadata

Quick Start

Option 1: Run the Full Pipeline (Recommended)

Process all PDFs and store everything in databases:

python run_pipeline.py

This will:

Load all PDFs from the data/ directory
Extract tables and text from each PDF
Store tables in MongoDB
Generate embeddings and store text vectors in Qdrant
Display progress for each file

Option 2: Process Documents Without Database Storage

Process a Single PDF

from ingest import process_single_pdf

# Process a PDF file
tables, texts = process_single_pdf("data/sample1.pdf")

print(f"Extracted {len(tables)} tables")
print(f"Extracted {len(texts)} text chunks")

Process All PDFs in a Directory

from ingest import process_directory

# Process all PDFs in the data folder
results = process_directory("data/")

for filename, (tables, texts) in results.items():
    print(f"{filename}: {len(tables)} tables, {len(texts)} text chunks")

Run from Command Line

# Process all PDFs in data/ directory
python ingest.py

# Process a specific PDF
python ingest.py path/to/your/document.pdf

Querying Your RAG System

Once you've ingested documents using run_pipeline.py, you can ask questions and get answers based on your document contents.

Prerequisites

Documents must be ingested (run python run_pipeline.py first)
Ollama must be installed and running
A model must be pulled (default: llama3:8b)

Option 1: Command-Line Interface (Easiest)

Use the ask.py CLI tool:

# Ask a question
python ask.py "What are the key findings in the document?"

# Ask about specific data
python ask.py "What were the Q3 revenue numbers?"

# Ask about tables
python ask.py "Summarize the financial results table"

Example Output:

Initializing Query Engine...
Connected to Ollama. Using model: llama3:8b
Searching vectors for: 'What are the key findings?'
Found 3 relevant text chunks.
Vector search identified: 'sample1.pdf' as the most relevant document.
Searching tables in MongoDB...
Found 2 relevant tables.

Synthesizing answer with local LLM (Ollama)...

================================================== ANSWER ==================================================
Based on the provided documents, the key findings include:
1. Revenue increased by 15% year-over-year
2. Customer satisfaction scores improved to 4.5/5
3. The new product line exceeded expectations
====================================================================================================

Option 2: Python API

Use the QueryEngine programmatically:

from query import QueryEngine

# Initialize the engine
engine = QueryEngine()

# Ask a question
answer = engine.ask("What are the payment terms?")
print(answer)

# The engine automatically:
# 1. Searches for semantically similar text chunks
# 2. Identifies the most relevant source file
# 3. Retrieves tables from that specific file
# 4. Combines context and generates an answer

How It Works: Hybrid RAG

The QueryEngine implements a hybrid retrieval strategy:

Vector Search (Qdrant): Finds the 3 most semantically similar text chunks to your question
Smart File Detection: Identifies which source file is most relevant from the vector results
Targeted Table Retrieval (MongoDB): Fetches tables only from that specific file
Context Building: Combines text chunks and tables into a rich context
Answer Generation (Ollama): Sends context + question to local LLM for synthesis

Why this approach?

Avoids irrelevant table data from unrelated documents
Provides both narrative context and structured data
Keeps all processing 100% local and private
No API keys or external services needed

Customizing the Query Engine

Edit query.py to customize:

# Change the LLM model (line 23)
self.llm_model = 'mistral'  # or 'llama2', 'codellama', etc.

# Adjust number of text chunks retrieved (line 44)
limit=5  # default is 3

# Adjust number of tables retrieved (line 58)
.limit(10)  # default is 5

# Change Ollama server URL (line 22)
self.llm_client = Client(host='http://your-server:11434')

Testing the RAG System

Generate Sample Data and Run Tests

The project includes comprehensive testing tools to validate RAG performance:

1. Generate Sample PDFs

First, install the PDF generation library:

pip install reportlab

Then generate 5 diverse sample PDFs:

python generate_sample_pdfs.py

This creates sample PDFs with various table types:

project_budget.pdf - Project budget and timeline
financial_report.pdf - Quarterly revenue and expenses
research_results.pdf - ML model performance data
product_specs.pdf - Hardware specifications
sales_report.pdf - Regional sales data

2. Ingest Sample Data

Process the sample PDFs:

python run_pipeline.py

3. Run Comprehensive Tests

Execute 20+ test cases:

python test_rag_queries.py

The test suite validates:

Simple table lookups (e.g., "What is the estimated hours for software development?")
Row/column intersections (e.g., "What was Q4 revenue for Cloud Services?")
Best performer identification (e.g., "Which ML model had highest accuracy?")
Multi-value extractions (e.g., "List all project phases")
Comparison queries (e.g., "Compare Random Forest and XGBoost models")

Test Output Example:

TEST: Simple Table Lookup - Single Value
QUESTION: What is the estimated hours for software development?
ANSWER: Based on the project budget table, the estimated hours for
        software development is 160 hours.
✓ PASSED: Answer contains expected content
Time taken: 5.23 seconds

TEST SUMMARY
Total tests run: 20
Tests passed: 19
Tests failed: 1
Average response time: 6.45 seconds

For detailed testing documentation, see TESTING.md

Viewing Stored Data

Option 1: Using Web UIs (Easiest)

After running run_pipeline.py, view your data in the web interfaces:

MongoDB (Tables)

Open http://localhost:8081 in your browser
Login with username: admin, password: pass
Navigate to hybrid_rag_db → document_tables
Browse extracted tables with all metadata

Qdrant (Vector Embeddings)

Open http://localhost:6334 in your browser
Click on the document_chunks collection
Click on any point ID to view its details
Important: Look at the "payload" section to see the actual text
- The "vector" field shows 384 numbers (embedding) - ignore this
- The "payload" field contains: text, source_filename, chunk_index
Expand the payload to read the text content

Note: If the Qdrant UI is hard to read, use the helper script instead:

python view_qdrant_data.py

Option 2: Use the Qdrant Data Viewer Script (Recommended)

For easier viewing of Qdrant data, use the included helper script:

# View collection statistics and recent points
python view_qdrant_data.py

# View all text chunks from a specific file
python view_qdrant_data.py view sample1.pdf

# Search for similar text
python view_qdrant_data.py search "payment terms"

# Show detailed statistics
python view_qdrant_data.py stats

This script displays the actual text content in a readable format, without the confusing vector numbers.

Option 3: Query Databases Programmatically

Query MongoDB for Tables

from pymongo import MongoClient

# Connect to MongoDB (authSource=admin is required for root user)
client = MongoClient("mongodb://root:examplepassword@localhost:27017/?authSource=admin")
db = client["hybrid_rag_db"]
collection = db["document_tables"]

# Find all tables from a specific file
tables = collection.find({"source_filename": "sample1.pdf"})

for table in tables:
    print(f"Table ID: {table['table_id']}")
    print(f"Page: {table['metadata']['page_number']}")
    print(f"Content: {table['content'][:100]}...")
    print()

Query Qdrant for Similar Text

from qdrant_client import QdrantClient
from embedding import EmbeddingModel

# Initialize
client = QdrantClient("localhost", port=6333)
embedder = EmbeddingModel()

# Search for similar content
query = "What are the key findings?"
query_vector = embedder.embed_texts([query])[0]

results = client.search(
    collection_name="document_chunks",
    query_vector=query_vector,
    limit=5
)

for result in results:
    print(f"Score: {result.score}")
    print(f"Text: {result.payload['text']}")
    print(f"Source: {result.payload['source_filename']}")
    print()

Viewing Parsed Content (Before Database Storage)

Option 1: Run Example Scripts

The project includes comprehensive examples:

# Run all examples
python example_usage.py

This will show you:

How to process single files
How to batch process directories
How to customize parsing settings
How to prepare data for database storage

Option 2: Access Parsed Data Programmatically

from ingest import process_single_pdf

# Process a PDF
tables, texts = process_single_pdf("data/sample1.pdf")

# View table data
for i, table in enumerate(tables):
    print(f"\nTable {i+1}:")
    print(f"  ID: {table['table_id']}")
    print(f"  Page: {table['metadata']['page_number']}")
    print(f"  Format: {table['content_type']}")
    print(f"  Content:\n{table['content']}")

# View text chunks
for i, text in enumerate(texts):
    print(f"\nText Chunk {i+1}:")
    print(f"  {text}")

Option 3: Save Output to Files

import json
from ingest import process_single_pdf

tables, texts = process_single_pdf("data/sample1.pdf")

# Save tables as JSON
with open("output_tables.json", "w") as f:
    json.dump(tables, indent=2, fp=f)

# Save text chunks
with open("output_texts.txt", "w") as f:
    for i, text in enumerate(texts, 1):
        f.write(f"=== Text Chunk {i} ===\n")
        f.write(text + "\n\n")

print("Results saved to output_tables.json and output_texts.txt")

Understanding the Output

Table Structure

Each table is returned as a dictionary with:

{
    "table_id": "table_0",           # Unique identifier
    "content": "<table>...</table>", # Table content (HTML or text)
    "content_type": "html",          # Format: "html" or "text"
    "metadata": {
        "page_number": 1,            # Page where table appears
        "filename": "sample.pdf",    # Source file
        "file_directory": "data/",   # Source directory
        "coordinates": {...},        # Position on page
        "parent_id": "..."          # Document hierarchy
    }
}

Text Chunks

Text is returned as a list of strings, with each string representing a coherent text segment:

[
    "This is the document introduction...",
    "Section 1: Overview of the project...",
    "Key findings include...",
    ...
]

Parsing Strategies

The parser supports three strategies:

Strategy	Speed	Accuracy	Requirements
`auto`	Fast	Good	Basic (default)
`fast`	Fastest	Good	Basic
`hi_res`	Slow	Best	Requires layoutparser*

*Note: hi_res strategy has dependency conflicts. The auto and fast strategies work well for most use cases.

Choosing a Strategy

from ingest import DocumentProcessor

processor = DocumentProcessor()

# Default: Good balance
tables, texts = processor.process_pdf("file.pdf", strategy="auto")

# Fast processing
tables, texts = processor.process_pdf("file.pdf", strategy="fast")

# Best accuracy (may fail without layoutparser)
tables, texts = processor.process_pdf("file.pdf", strategy="hi_res")

⚠️ Table Formatting Best Practices

IMPORTANT: Dark backgrounds can prevent table extraction

The unstructured library has difficulty extracting text from tables with dark backgrounds and light text. For best results:

✓ DO:

Use light backgrounds (white, light grey, light blue)
Use dark text (black, dark grey)
Maintain good contrast

✗ AVOID:

Dark backgrounds (dark blue, dark red, black)
White/light text on dark backgrounds
Low contrast combinations

Example - What Works:

# Good: Light background, dark text
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.lightgrey),  # ✓
    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
    ('GRID', (0, 0), (-1, -1), 1, colors.black)
]))

Example - What Fails:

# Bad: Dark background, white text
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.darkblue),   # ✗
    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),  # ✗
]))
# Result: Empty table extraction

If your PDFs have dark-themed tables that aren't being extracted, regenerate them with lighter styling.

Project Structure

hybrid-rag-parser/
├── src/                      # Source code organized by functionality
│   ├── __init__.py
│   ├── ingestion/           # Document processing and embedding
│   │   ├── __init__.py
│   │   ├── ingest.py        # PDF parsing and table/text extraction
│   │   └── embedding.py     # Text embedding with sentence-transformers
│   ├── database/            # Database connectors
│   │   ├── __init__.py
│   │   └── db_connectors.py # MongoDB and Qdrant connections
│   ├── query/               # RAG query engine
│   │   ├── __init__.py
│   │   └── query.py         # Hybrid search with local LLM
│   └── utils/               # Utility scripts
│       ├── __init__.py
│       └── view_qdrant_data.py # View/search Qdrant data
├── tests/                   # Test suite and sample data generation
│   ├── __init__.py
│   ├── generate_sample_pdfs.py # Generate 5 diverse test PDFs
│   └── test_rag_queries.py     # 20+ automated test cases
├── examples/                # Usage examples
│   ├── __init__.py
│   └── example_usage.py    # Document processing examples
├── data/                    # PDF files for ingestion
│   ├── sample1.pdf
│   ├── sample2.pdf
│   ├── sample3.pdf
│   ├── project_budget.pdf  # Generated test PDF
│   ├── financial_report.pdf
│   ├── research_results.pdf
│   ├── product_specs.pdf
│   └── sales_report.pdf
├── run_pipeline.py          # Main orchestration script (run this!)
├── ask.py                   # CLI tool for asking questions
├── check_setup.py           # Installation verification
├── docker-compose.yml       # Database containers
├── requirements.txt         # Python dependencies
├── README.md                # This file
├── SETUP.md                 # Detailed setup instructions
├── TESTING.md               # Testing guide
└── .gitignore               # Git ignore rules

Module Descriptions

Main Entry Points (Root Level)

Module	Purpose
`run_pipeline.py`	Main entry point - orchestrates full pipeline
`ask.py`	CLI tool - simple command-line interface for questions
`check_setup.py`	Verify Python dependencies are installed

Source Code (src/)

Module	Purpose
`src/ingestion/ingest.py`	PDF parsing, table/text extraction
`src/ingestion/embedding.py`	Generate 384-dim vectors using sentence-transformers
`src/database/db_connectors.py`	MongoDB and Qdrant connection management
`src/query/query.py`	RAG query engine - hybrid search with local LLM
`src/utils/view_qdrant_data.py`	View & search Qdrant data in readable format

Tests (tests/)

Module	Purpose
`tests/generate_sample_pdfs.py`	Generate test data - creates 5 sample PDFs with diverse tables
`tests/test_rag_queries.py`	Test suite - 20+ test cases to validate RAG performance

Examples (examples/)

Module	Purpose
`examples/example_usage.py`	Usage examples for document processing

Documentation

File	Purpose
`README.md`	Main documentation (this file)
`SETUP.md`	Detailed setup instructions
`TESTING.md`	Testing guide and troubleshooting
`docker-compose.yml`	Database container configuration

Configuration

Default Settings

The pipeline uses the following default configurations:

Component	Setting	Value
MongoDB	Host	localhost:27017
MongoDB	Database	hybrid_rag_db
MongoDB	Collection	document_tables
MongoDB	Credentials	root/examplepassword
Qdrant	Host	localhost:6333
Qdrant	Collection	document_chunks
Qdrant	Distance Metric	Cosine
Embedding Model	Name	all-MiniLM-L6-v2
Embedding Model	Vector Size	384
PDF Directory	Path	./data
Parse Strategy	Default	auto
Ollama	Host	localhost:11434
Ollama	Default Model	llama3:8b
Mongo Express	Web UI Username	admin
Mongo Express	Web UI Password	pass

Customizing Settings

Change MongoDB password:

Edit docker-compose.yml (lines 14-15 and 36-37)
Edit db_connectors.py (line 18)
Restart containers: docker-compose down && docker-compose up -d

Change embedding model:

Edit embedding.py (line 17)
Update vector_size based on the new model
Note: Changing models requires recreating the Qdrant collection

Change PDF directory:

Edit run_pipeline.py (line 23)

Change parsing strategy:

Edit run_pipeline.py (line 59)
Options: "auto" (recommended), "fast", "hi_res"
Note: "auto" provides the best balance of speed and table detection accuracy

Change Ollama model:

Edit query.py (line 23)
Options: "llama3:8b", "mistral", "llama2", "codellama", etc.
Make sure to pull the model first: ollama pull <model-name>

Pipeline Workflow

When you run python run_pipeline.py, here's what happens:

Initialization
- Load sentence-transformer model (all-MiniLM-L6-v2)
- Connect to MongoDB (table storage)
- Connect to Qdrant (vector storage)
- Create/recreate Qdrant collection with 384-dim vectors
For Each PDF in data/
- Parse PDF and extract elements
- Separate tables from narrative text
- Store tables in MongoDB with metadata
- Generate embeddings for text chunks
- Store vectors in Qdrant with payloads
Results
- All tables queryable in MongoDB
- All text searchable via semantic similarity in Qdrant
- Access via web UIs or programmatic queries

Database Maintenance

Clearing All Data

To clear all ingested data (useful before re-ingesting with updated PDFs):

python clear_databases.py
# Type 'yes' to confirm

This will:

Delete all documents from MongoDB
Delete and recreate the Qdrant collection
Provide a clean slate for re-ingestion

When to use:

After fixing PDF table formatting issues
Before re-ingesting updated documents
To start fresh with new data

Complete reset workflow:

# 1. Clear old data
python clear_databases.py

# 2. Regenerate PDFs (if needed)
python generate_sample_pdfs.py

# 3. Re-ingest
python run_pipeline.py

# 4. Test queries
python test_rag_queries.py

Alternative: Docker reset (nuclear option)

# Completely wipe databases and restart containers
docker-compose down -v
docker-compose up -d

Docker Management

Useful Commands

# Start all services
docker-compose up -d

# Stop all services
docker-compose down

# View logs
docker-compose logs -f

# Restart a specific service
docker-compose restart mongo

# Stop and remove all data (volumes)
docker-compose down -v

# View resource usage
docker stats

Accessing Database Shells

# MongoDB shell
docker exec -it hybrid-rag-mongo mongosh -u root -p examplepassword

# Check Qdrant collections
curl http://localhost:6333/collections

Troubleshooting

Python Version Issues

If you see errors about Python version:

# Check your Python version
python --version

# Install Python 3.11 using pyenv
pyenv install 3.11.9
pyenv local 3.11.9

# Or use conda
conda create -n rag-pipeline python=3.11
conda activate rag-pipeline

Docker Issues

Containers won't start:

# Check if Docker is running
docker --version
docker ps

# View container logs
docker-compose logs

# Restart containers
docker-compose down
docker-compose up -d

Port conflicts (address already in use):

# Check which process is using the port
sudo lsof -i :27017  # MongoDB
sudo lsof -i :6333   # Qdrant
sudo lsof -i :8081   # Mongo Express

# Either stop the conflicting service or modify ports in docker-compose.yml

Cannot connect to databases:

# Ensure containers are running
docker ps

# Test MongoDB connection
docker exec -it hybrid-rag-mongo mongosh -u root -p examplepassword

# Check Qdrant health
curl http://localhost:6333/health

Clear all data and restart:

# Stop containers and remove volumes
docker-compose down -v

# Restart fresh
docker-compose up -d

Missing System Dependencies

If you see errors about poppler or tesseract:

# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocr libmagic1

# macOS
brew install poppler tesseract libmagic

Import Errors

If you get ModuleNotFoundError:

# Ensure you're in the virtual environment
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Reinstall dependencies
pip install --upgrade pip
pip install -r requirements.txt

Empty Results

If no tables or text are extracted:

Check that your PDF contains actual text (not just images)
Check for dark backgrounds in tables - Dark backgrounds with light text prevent extraction (see Table Formatting Best Practices)
Try a different parsing strategy (auto, fast, or hi_res)
Verify the PDF isn't password-protected or corrupted
Test with debug script: python test_financial_pdf.py to see what was extracted

Common cause: Tables with dark backgrounds (darkblue, darkred) and white text return empty content. Solution: Regenerate PDFs with light backgrounds.

Database Connection Errors

If run_pipeline.py fails to connect:

Ensure Docker containers are running: docker ps
Check database credentials in db_connectors.py match docker-compose.yml
Wait a few seconds after starting containers for databases to initialize

MongoDB Authentication Failed Error

If you see an error like:

Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18}

The Problem: The MongoDB connection string is missing the authSource parameter.

The Fix: The connection string in db_connectors.py should include ?authSource=admin:

MongoClient("mongodb://root:examplepassword@localhost:27017/?authSource=admin")

This tells MongoDB to authenticate against the admin database where the root user is stored. This fix is already applied in the latest version.

Qdrant UI Shows Weird Characters

If you see strange encoded text in the Qdrant web UI:

The Problem: You're looking at the "vector" field (384 floating-point numbers) instead of the actual text.

The Solution:

In the Qdrant UI, click on a point ID
Scroll down to the "payload" section
Expand the payload to see:
- text: The actual readable text
- source_filename: Which PDF it came from
- chunk_index: Position in the document

Better Option: Use the helper script for easier viewing:

python view_qdrant_data.py

This displays the text in a clean, readable format without the vector numbers.

Ollama Connection Issues

If you get "OLLAMA CONNECTION FAILED" when running query.py or ask.py:

Check if Ollama is running:

# Test connection
curl http://localhost:11434/api/tags

# If not running, start Ollama
ollama serve  # Linux/macOS
# Or launch the Ollama app on macOS/Windows

Verify model is pulled:

# List available models
ollama list

# If llama3:8b is missing, pull it
ollama pull llama3:8b

Connection refused errors:

Ensure Ollama is installed: ollama --version
Check if port 11434 is blocked by firewall
Try running Ollama explicitly: ollama serve

Model not found errors:

# Pull the specific model mentioned in the error
ollama pull llama3:8b

# Or use a different model by editing query.py line 23

Slow response times:

First query loads the model into memory (slow)
Subsequent queries use cached model (faster)
Consider using smaller models like llama2:7b or mistral:7b
Check system resources: ollama ps to see running models

Alternative models to try:

# Smaller, faster models
ollama pull mistral
ollama pull llama2

# Larger, more accurate models
ollama pull llama3:70b  # Requires significant RAM

# List all available models
ollama list

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

MIT License - see LICENSE file

Support

For issues and questions:

Check the SETUP.md file for detailed setup instructions
Review the example_usage.py for code examples
Run python check_setup.py to verify your installation

Acknowledgments

Built with:

unstructured.io - Document parsing and table extraction
pdf2image - PDF processing
Tesseract - OCR capabilities
sentence-transformers - Text embeddings (all-MiniLM-L6-v2)
MongoDB - NoSQL database for table storage
Qdrant - Vector database for semantic search
Ollama - Local LLM inference for RAG queries
Docker - Containerization
Mongo Express - MongoDB web interface

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
debug		debug
examples		examples
src		src
tests		tests
.gitignore		.gitignore
API.md		API.md
API_IMPLEMENTATION.md		API_IMPLEMENTATION.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
TESTING.md		TESTING.md
api_server.py		api_server.py
ask.py		ask.py
check_setup.py		check_setup.py
clear_databases.py		clear_databases.py
docker-compose.yml		docker-compose.yml
generate_sample_pdfs.py		generate_sample_pdfs.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
test_financial_pdf.py		test_financial_pdf.py
test_rag_queries.py		test_rag_queries.py
view_qdrant_data.py		view_qdrant_data.py

Folders and files

Latest commit

History

Repository files navigation