Skip to content

TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.

Notifications You must be signed in to change notification settings

HemaKumar0077/TableRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š TableRAG β€” Retrieval-Augmented Generation for Tables + Text

TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.

Python 3.12+ Streamlit MIT License


πŸš€ Features

βœ… Multi-modal Document Support: CSV tables, PDF documents, Word files, Markdown, and plain text
βœ… Hybrid RAG Architecture: Combines SQL execution (precise) + vector search (semantic)
βœ… Interactive Streamlit UI: Drag-and-drop uploads with real-time processing
βœ… Intelligent Query Processing: LLM-powered query decomposition and answer synthesis
βœ… Advanced Data Handling: Auto-encoding detection, CSV dialect sniffing, column type inference
βœ… Comprehensive Error Handling: Graceful fallbacks and detailed debug information
βœ… In-Memory Processing: Fast iteration without persistent storage requirements
βœ… CLI Support: Command-line interface for batch processing


πŸŽ₯ Demo

Interface Screenshot

TableRAG Interface

Video Walkthrough

Option 1: HTML Video (works on some platforms)

Your browser does not support the video tag.

Option 2: Clickable Video Thumbnail TableRAG Demo Video

Option 3: Direct Link

🎬 ▢️ Watch Full Demo Video - Complete walkthrough of TableRAG features

The interface demonstrates the clean, intuitive design with:

  • πŸ“ Drag-and-drop file upload (CSV, PDF, DOCX, TXT, MD)
  • ⚑ Real-time processing with progress indicators
  • πŸ”§ Debug mode with SQL query inspection
  • πŸ’¬ Interactive Q&A with comprehensive answers

πŸ—οΈ Architecture Flow

graph TD
    A[πŸ“ File Upload] --> B{File Type?}
    B -->|CSV| C[πŸ—ƒοΈ CSV Parser]
    B -->|PDF/DOCX/TXT| D[πŸ“„ Text Extractor]
    
    C --> E[🧠 SQL Schema Generation]
    E --> F[πŸ’Ύ SQLite In-Memory DB]
    
    D --> G[βœ‚οΈ Text Chunking]
    G --> H[πŸ”€ Sentence Transformers]
    H --> I[πŸ” FAISS Vector Index]
    
    J[❓ User Query] --> K[πŸ€– Query Decomposition<br/>Groq LLM]
    
    K --> L[πŸ” Vector Search]
    I --> L
    L --> M[πŸ“š Retrieved Chunks]
    
    K --> N[πŸ’¬ NL2SQL Generation]
    F --> N
    N --> O[βš™οΈ SQL Execution]
    O --> P[πŸ“Š Query Results]
    
    M --> Q[🎯 Answer Synthesis<br/>Groq LLM]
    P --> Q
    Q --> R[βœ… Final Answer]
Loading

Core Components:

  1. Document Ingestion: Multi-format file processing with validation
  2. Dual Storage: SQLite tables + FAISS vector embeddings
  3. Query Intelligence: LLM-powered query understanding and decomposition
  4. Hybrid Retrieval: SQL precision + semantic search
  5. Answer Generation: Context-aware response synthesis

πŸ—‚οΈ Project Structure

TableRAG/
β”œβ”€β”€ 🎯 Core Application
β”‚   β”œβ”€β”€ streamlit_app.py          # Main Streamlit UI (268 lines)
β”‚   β”œβ”€β”€ run.py                    # CLI interface
β”‚   └── app/                      # Core logic modules
β”‚       β”œβ”€β”€ config.py             # Environment configuration
β”‚       β”œβ”€β”€ pipeline/
β”‚       β”‚   └── rag_pipeline.py   # Main RAG orchestration (227 lines)
β”‚       β”œβ”€β”€ llm/
β”‚       β”‚   β”œβ”€β”€ groq_client.py    # Groq API integration
β”‚       β”‚   └── prompts.py        # LLM prompt templates
β”‚       β”œβ”€β”€ database/
β”‚       β”‚   └── sql_executor.py   # SQLite operations (269 lines)
β”‚       β”œβ”€β”€ embeddings/
β”‚       β”‚   └── embedder.py       # Sentence Transformers wrapper
β”‚       β”œβ”€β”€ retrieval/
β”‚       β”‚   └── faiss_index.py    # FAISS vector operations
β”‚       └── utils/
β”‚           β”œβ”€β”€ ingest.py         # Multi-format file processing (321 lines)
β”‚           β”œβ”€β”€ chunking.py       # Text segmentation
β”‚           └── logging.py        # Centralized logging
β”‚
β”œβ”€β”€ πŸ“ Data & Storage
β”‚   β”œβ”€β”€ data/                     # User data directories
β”‚   β”‚   β”œβ”€β”€ tables/               # CSV files (persistent)
β”‚   β”‚   └── texts/                # Text documents (persistent)
β”‚   β”œβ”€β”€ db/                       # SQLite databases
β”‚   β”‚   └── tables.db             # Persistent database (optional)
β”‚   └── index/                    # FAISS index files
β”‚       └── faiss.index           # Vector index (persistent)
β”‚
β”œβ”€β”€ 🎬 Assets & Documentation
β”‚   β”œβ”€β”€ assets/
β”‚   β”‚   β”œβ”€β”€ Screenshot 2025-10-09 230417.png    # UI demo
β”‚   β”‚   └── Screen Recording 2025-10-09 225828.mp4  # Video demo
β”‚   β”œβ”€β”€ test_assets/              # Sample files for testing
β”‚   β”‚   β”œβ”€β”€ test.csv
β”‚   β”‚   β”œβ”€β”€ report.pdf
β”‚   β”‚   └── report.html
β”‚   └── README.md                 # This documentation
β”‚
β”œβ”€β”€ βš™οΈ Configuration
β”‚   β”œβ”€β”€ requirements.txt          # Python dependencies
β”‚   β”œβ”€β”€ .env                      # Environment variables (create this)
β”‚   β”œβ”€β”€ .gitignore               # Git exclusions
β”‚   └── helper.py                # Development utilities
β”‚
└── 🐍 Virtual Environment
    └── trag/                     # Python virtual environment

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.12+ (recommended)
  • Groq API Key (for LLM access)
  • Git (for cloning)

1. Clone Repository

git clone https://github.com/HemaKumar0077/TableRAG
cd TableRAG

2. Create Virtual Environment

# Windows
python -m venv trag
trag\Scripts\activate

# macOS/Linux  
python3 -m venv trag
source trag/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

Create a .env file in the project root:

# ===== REQUIRED CONFIGURATION =====
GROQ_API_KEY=gsk_your_groq_api_key_here

# ===== OPTIONAL CONFIGURATION =====
# Embedding Model (Hugging Face)
EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2

# Database Settings
DB_TYPE=sqlite
SQLITE_DB_PATH=db/tables.db

# FAISS Index Configuration  
FAISS_INDEX_PATH=index/faiss.index

# Retrieval Parameters
TOP_K_RETRIEVAL=5
MAX_ITERATIONS=1

# Logging
LOG_LEVEL=INFO

πŸ“‹ How to get a Groq API Key:

  1. Visit console.groq.com
  2. Sign up/login with your account
  3. Navigate to "API Keys" section
  4. Create a new API key
  5. Copy and paste into your .env file

5. Create Required Directories

mkdir -p data/tables data/texts db index

πŸš€ Usage

Option 1: Streamlit Web Interface (Recommended)

streamlit run streamlit_app.py

Features:

  • πŸ–±οΈ Drag & Drop: Upload CSV, PDF, DOCX, TXT, MD files
  • ⚑ Real-time Processing: See upload progress and validation
  • πŸ”§ Debug Mode: Inspect SQL queries and execution details
  • πŸ“Š Interactive Results: View data tables and text chunks
  • ⚠️ Error Handling: Clear feedback on processing issues

Workflow:

  1. Upload Files: Drag CSV files (β†’ tables) and text files (β†’ chunks)
  2. Process Documents: Click "πŸš€ Process Documents"
  3. Ask Questions: Type natural language queries
  4. Get Answers: View synthesized responses with debug info

Option 2: Command Line Interface

python run.py

Example Session:

πŸ” TableRAG CLI
Ask a question based on your text and table knowledge base.

🧠 Enter your question: What was the total revenue by region?
βœ… Answer: Based on the sales data, the total revenue by region is...

--- Debug Info ---
πŸ“š Retrieved Chunks: [relevant text excerpts]
πŸ“„ SQL Query: SELECT region, SUM(revenue) FROM sales_data GROUP BY region
🧾 SQL Result: [{"region": "North", "revenue": 150000}, ...]

πŸ’‘ Example Queries

πŸ“Š Table Analysis:

  • "What is the total sales revenue across all regions?"
  • "Which product had the highest growth rate?"
  • "Show me all customers with orders above $10,000"
  • "What is the average age of customers by location?"

πŸ“„ Document Search:

  • "What are the key findings from the uploaded reports?"
  • "Summarize the main recommendations in the documents"
  • "What challenges were mentioned in the analysis?"

πŸ”— Hybrid Queries:

  • "Based on the sales data, what do the reports say about market trends?"
  • "Compare the revenue figures with the strategic recommendations"

πŸ—οΈ Technical Architecture

🧠 LLM Integration (Groq)

  • Model: Llama-3.3-70B-Versatile
  • API: OpenAI-compatible REST interface
  • Functions: Query decomposition, SQL generation, answer synthesis
  • Timeout: 30-second request limit with retry logic

πŸ” Vector Search (FAISS)

  • Algorithm: Inner Product (IP) for cosine similarity
  • Embeddings: Sentence Transformers (384-dim by default)
  • Storage: In-memory with optional persistence
  • Performance: Sub-second search on 100K+ chunks

πŸ—„οΈ Database Operations (SQLite)

  • Connection: Thread-safe, in-memory primary storage
  • Features: Auto-schema inference, type detection, sanitization
  • Safety: SQL injection protection, transaction management
  • Validation: Comprehensive error handling and rollback

πŸ“ File Processing Pipeline

# Supported formats and processing
SUPPORTED_FORMATS = {
    'CSV': 'Parsed β†’ SQLite tables with type inference',
    'PDF': 'Text extraction β†’ chunked β†’ vectorized', 
    'DOCX': 'Content extraction β†’ chunked β†’ vectorized',
    'TXT/MD': 'Direct chunking β†’ vectorized',
}

πŸ”§ Configuration Options

Variable Default Description
GROQ_API_KEY Required Your Groq API authentication key
EMBEDDING_MODEL_NAME all-MiniLM-L6-v2 Hugging Face model for embeddings
SQLITE_DB_PATH db/tables.db Persistent SQLite database location
FAISS_INDEX_PATH index/faiss.index FAISS vector index file path
TOP_K_RETRIEVAL 5 Number of text chunks to retrieve
LOG_LEVEL INFO Logging verbosity (DEBUG/INFO/WARNING/ERROR)

πŸ› Troubleshooting

Common Issues

❌ "Failed to load embedding model"

# Solution: Install/update transformers
pip install --upgrade sentence-transformers torch

❌ "Groq API authentication failed"

# Check your .env file has the correct API key
echo $GROQ_API_KEY  # Should show your key

❌ "CSV parsing errors"

  • Cause: Encoding issues or malformed CSV
  • Solution: Check file encoding, verify CSV structure
  • Debug: Enable "Show Debug Information" in UI

❌ "Empty query results"

  • Cause: No relevant data found
  • Solution: Verify files were processed successfully
  • Check: File Information sidebar shows loaded tables/chunks

Performance Optimization

  • Large CSVs: Files auto-process in 1000-row batches
  • Memory Usage: Consider smaller TOP_K_RETRIEVAL values
  • Response Time: Use more specific queries for faster results

πŸ“Š Monitoring & Logging

Log Location: app.log (rotating, 5MB max)

Log Levels Available:

LOG_LEVEL=DEBUG    # Detailed query and processing info
LOG_LEVEL=INFO     # Standard operational messages  
LOG_LEVEL=WARNING  # Issues that don't break functionality
LOG_LEVEL=ERROR    # Critical errors requiring attention

Key Metrics Logged:

  • File processing times and success rates
  • SQL query execution and results
  • Vector search performance
  • LLM API response times and errors

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Groq - Fast LLM inference
  • Hugging Face - Transformer models and embeddings
  • FAISS - Efficient similarity search
  • Streamlit - Rapid web app development
  • SQLite - Embedded database engine

Built with ❀️ for intelligent document analysis

About

TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published