📊 TableRAG — Retrieval-Augmented Generation for Tables + Text

TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.

🚀 Features

✅ Multi-modal Document Support: CSV tables, PDF documents, Word files, Markdown, and plain text
✅ Hybrid RAG Architecture: Combines SQL execution (precise) + vector search (semantic)
✅ Interactive Streamlit UI: Drag-and-drop uploads with real-time processing
✅ Intelligent Query Processing: LLM-powered query decomposition and answer synthesis
✅ Advanced Data Handling: Auto-encoding detection, CSV dialect sniffing, column type inference
✅ Comprehensive Error Handling: Graceful fallbacks and detailed debug information
✅ In-Memory Processing: Fast iteration without persistent storage requirements
✅ CLI Support: Command-line interface for batch processing

🎥 Demo

Interface Screenshot

Video Walkthrough

Option 1: HTML Video (works on some platforms)

Your browser does not support the video tag.

Option 2: Clickable Video Thumbnail

Option 3: Direct Link

🎬 ▶️ Watch Full Demo Video - Complete walkthrough of TableRAG features

The interface demonstrates the clean, intuitive design with:

📁 Drag-and-drop file upload (CSV, PDF, DOCX, TXT, MD)
⚡ Real-time processing with progress indicators
🔧 Debug mode with SQL query inspection
💬 Interactive Q&A with comprehensive answers

🏗️ Architecture Flow

graph TD
    A[📁 File Upload] --> B{File Type?}
    B -->|CSV| C[🗃️ CSV Parser]
    B -->|PDF/DOCX/TXT| D[📄 Text Extractor]
    
    C --> E[🧠 SQL Schema Generation]
    E --> F[💾 SQLite In-Memory DB]
    
    D --> G[✂️ Text Chunking]
    G --> H[🔤 Sentence Transformers]
    H --> I[🔍 FAISS Vector Index]
    
    J[❓ User Query] --> K[🤖 Query Decomposition<br/>Groq LLM]
    
    K --> L[🔍 Vector Search]
    I --> L
    L --> M[📚 Retrieved Chunks]
    
    K --> N[💬 NL2SQL Generation]
    F --> N
    N --> O[⚙️ SQL Execution]
    O --> P[📊 Query Results]
    
    M --> Q[🎯 Answer Synthesis<br/>Groq LLM]
    P --> Q
    Q --> R[✅ Final Answer]

Core Components:

Document Ingestion: Multi-format file processing with validation
Dual Storage: SQLite tables + FAISS vector embeddings
Query Intelligence: LLM-powered query understanding and decomposition
Hybrid Retrieval: SQL precision + semantic search
Answer Generation: Context-aware response synthesis

🗂️ Project Structure

TableRAG/
├── 🎯 Core Application
│   ├── streamlit_app.py          # Main Streamlit UI (268 lines)
│   ├── run.py                    # CLI interface
│   └── app/                      # Core logic modules
│       ├── config.py             # Environment configuration
│       ├── pipeline/
│       │   └── rag_pipeline.py   # Main RAG orchestration (227 lines)
│       ├── llm/
│       │   ├── groq_client.py    # Groq API integration
│       │   └── prompts.py        # LLM prompt templates
│       ├── database/
│       │   └── sql_executor.py   # SQLite operations (269 lines)
│       ├── embeddings/
│       │   └── embedder.py       # Sentence Transformers wrapper
│       ├── retrieval/
│       │   └── faiss_index.py    # FAISS vector operations
│       └── utils/
│           ├── ingest.py         # Multi-format file processing (321 lines)
│           ├── chunking.py       # Text segmentation
│           └── logging.py        # Centralized logging
│
├── 📁 Data & Storage
│   ├── data/                     # User data directories
│   │   ├── tables/               # CSV files (persistent)
│   │   └── texts/                # Text documents (persistent)
│   ├── db/                       # SQLite databases
│   │   └── tables.db             # Persistent database (optional)
│   └── index/                    # FAISS index files
│       └── faiss.index           # Vector index (persistent)
│
├── 🎬 Assets & Documentation
│   ├── assets/
│   │   ├── Screenshot 2025-10-09 230417.png    # UI demo
│   │   └── Screen Recording 2025-10-09 225828.mp4  # Video demo
│   ├── test_assets/              # Sample files for testing
│   │   ├── test.csv
│   │   ├── report.pdf
│   │   └── report.html
│   └── README.md                 # This documentation
│
├── ⚙️ Configuration
│   ├── requirements.txt          # Python dependencies
│   ├── .env                      # Environment variables (create this)
│   ├── .gitignore               # Git exclusions
│   └── helper.py                # Development utilities
│
└── 🐍 Virtual Environment
    └── trag/                     # Python virtual environment

🛠️ Installation & Setup

Prerequisites

Python 3.12+ (recommended)
Groq API Key (for LLM access)
Git (for cloning)

1. Clone Repository

git clone https://github.com/HemaKumar0077/TableRAG
cd TableRAG

2. Create Virtual Environment

# Windows
python -m venv trag
trag\Scripts\activate

# macOS/Linux  
python3 -m venv trag
source trag/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

Create a .env file in the project root:

# ===== REQUIRED CONFIGURATION =====
GROQ_API_KEY=gsk_your_groq_api_key_here

# ===== OPTIONAL CONFIGURATION =====
# Embedding Model (Hugging Face)
EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2

# Database Settings
DB_TYPE=sqlite
SQLITE_DB_PATH=db/tables.db

# FAISS Index Configuration  
FAISS_INDEX_PATH=index/faiss.index

# Retrieval Parameters
TOP_K_RETRIEVAL=5
MAX_ITERATIONS=1

# Logging
LOG_LEVEL=INFO

📋 How to get a Groq API Key:

Visit console.groq.com
Sign up/login with your account
Navigate to "API Keys" section
Create a new API key
Copy and paste into your .env file

5. Create Required Directories

mkdir -p data/tables data/texts db index

🚀 Usage

Option 1: Streamlit Web Interface (Recommended)

streamlit run streamlit_app.py

Features:

🖱️ Drag & Drop: Upload CSV, PDF, DOCX, TXT, MD files
⚡ Real-time Processing: See upload progress and validation
🔧 Debug Mode: Inspect SQL queries and execution details
📊 Interactive Results: View data tables and text chunks
⚠️ Error Handling: Clear feedback on processing issues

Workflow:

Upload Files: Drag CSV files (→ tables) and text files (→ chunks)
Process Documents: Click "🚀 Process Documents"
Ask Questions: Type natural language queries
Get Answers: View synthesized responses with debug info

Option 2: Command Line Interface

python run.py

Example Session:

🔍 TableRAG CLI
Ask a question based on your text and table knowledge base.

🧠 Enter your question: What was the total revenue by region?
✅ Answer: Based on the sales data, the total revenue by region is...

--- Debug Info ---
📚 Retrieved Chunks: [relevant text excerpts]
📄 SQL Query: SELECT region, SUM(revenue) FROM sales_data GROUP BY region
🧾 SQL Result: [{"region": "North", "revenue": 150000}, ...]

💡 Example Queries

📊 Table Analysis:

"What is the total sales revenue across all regions?"
"Which product had the highest growth rate?"
"Show me all customers with orders above $10,000"
"What is the average age of customers by location?"

📄 Document Search:

"What are the key findings from the uploaded reports?"
"Summarize the main recommendations in the documents"
"What challenges were mentioned in the analysis?"

🔗 Hybrid Queries:

"Based on the sales data, what do the reports say about market trends?"
"Compare the revenue figures with the strategic recommendations"

🏗️ Technical Architecture

🧠 LLM Integration (Groq)

Model: Llama-3.3-70B-Versatile
API: OpenAI-compatible REST interface
Functions: Query decomposition, SQL generation, answer synthesis
Timeout: 30-second request limit with retry logic

🔍 Vector Search (FAISS)

Algorithm: Inner Product (IP) for cosine similarity
Embeddings: Sentence Transformers (384-dim by default)
Storage: In-memory with optional persistence
Performance: Sub-second search on 100K+ chunks

🗄️ Database Operations (SQLite)

Connection: Thread-safe, in-memory primary storage
Features: Auto-schema inference, type detection, sanitization
Safety: SQL injection protection, transaction management
Validation: Comprehensive error handling and rollback

📁 File Processing Pipeline

# Supported formats and processing
SUPPORTED_FORMATS = {
    'CSV': 'Parsed → SQLite tables with type inference',
    'PDF': 'Text extraction → chunked → vectorized', 
    'DOCX': 'Content extraction → chunked → vectorized',
    'TXT/MD': 'Direct chunking → vectorized',
}

🔧 Configuration Options

Variable	Default	Description
`GROQ_API_KEY`	Required	Your Groq API authentication key
`EMBEDDING_MODEL_NAME`	`all-MiniLM-L6-v2`	Hugging Face model for embeddings
`SQLITE_DB_PATH`	`db/tables.db`	Persistent SQLite database location
`FAISS_INDEX_PATH`	`index/faiss.index`	FAISS vector index file path
`TOP_K_RETRIEVAL`	`5`	Number of text chunks to retrieve
`LOG_LEVEL`	`INFO`	Logging verbosity (DEBUG/INFO/WARNING/ERROR)

🐛 Troubleshooting

Common Issues

❌ "Failed to load embedding model"

# Solution: Install/update transformers
pip install --upgrade sentence-transformers torch

❌ "Groq API authentication failed"

# Check your .env file has the correct API key
echo $GROQ_API_KEY  # Should show your key

❌ "CSV parsing errors"

Cause: Encoding issues or malformed CSV
Solution: Check file encoding, verify CSV structure
Debug: Enable "Show Debug Information" in UI

❌ "Empty query results"

Cause: No relevant data found
Solution: Verify files were processed successfully
Check: File Information sidebar shows loaded tables/chunks

Performance Optimization

Large CSVs: Files auto-process in 1000-row batches
Memory Usage: Consider smaller TOP_K_RETRIEVAL values
Response Time: Use more specific queries for faster results

📊 Monitoring & Logging

Log Location: app.log (rotating, 5MB max)

Log Levels Available:

LOG_LEVEL=DEBUG    # Detailed query and processing info
LOG_LEVEL=INFO     # Standard operational messages  
LOG_LEVEL=WARNING  # Issues that don't break functionality
LOG_LEVEL=ERROR    # Critical errors requiring attention

Key Metrics Logged:

File processing times and success rates
SQL query execution and results
Vector search performance
LLM API response times and errors

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Groq - Fast LLM inference
Hugging Face - Transformer models and embeddings
FAISS - Efficient similarity search
Streamlit - Rapid web app development
SQLite - Embedded database engine

Built with ❤️ for intelligent document analysis

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
app		app
assets		assets
data		data
db		db
index		index
test_assets		test_assets
.gitignore		.gitignore
README.md		README.md
helper.py		helper.py
requirements.txt		requirements.txt
run.py		run.py
streamlit_app.py		streamlit_app.py

HemaKumar0077/TableRAG

Folders and files

Latest commit

History

Repository files navigation