TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.
β
Multi-modal Document Support: CSV tables, PDF documents, Word files, Markdown, and plain text
β
Hybrid RAG Architecture: Combines SQL execution (precise) + vector search (semantic)
β
Interactive Streamlit UI: Drag-and-drop uploads with real-time processing
β
Intelligent Query Processing: LLM-powered query decomposition and answer synthesis
β
Advanced Data Handling: Auto-encoding detection, CSV dialect sniffing, column type inference
β
Comprehensive Error Handling: Graceful fallbacks and detailed debug information
β
In-Memory Processing: Fast iteration without persistent storage requirements
β
CLI Support: Command-line interface for batch processing
Option 1: HTML Video (works on some platforms)
Option 2: Clickable Video Thumbnail

Option 3: Direct Link
π¬
βΆοΈ Watch Full Demo Video - Complete walkthrough of TableRAG features
The interface demonstrates the clean, intuitive design with:
- π Drag-and-drop file upload (CSV, PDF, DOCX, TXT, MD)
- β‘ Real-time processing with progress indicators
- π§ Debug mode with SQL query inspection
- π¬ Interactive Q&A with comprehensive answers
graph TD
A[π File Upload] --> B{File Type?}
B -->|CSV| C[ποΈ CSV Parser]
B -->|PDF/DOCX/TXT| D[π Text Extractor]
C --> E[π§ SQL Schema Generation]
E --> F[πΎ SQLite In-Memory DB]
D --> G[βοΈ Text Chunking]
G --> H[π€ Sentence Transformers]
H --> I[π FAISS Vector Index]
J[β User Query] --> K[π€ Query Decomposition<br/>Groq LLM]
K --> L[π Vector Search]
I --> L
L --> M[π Retrieved Chunks]
K --> N[π¬ NL2SQL Generation]
F --> N
N --> O[βοΈ SQL Execution]
O --> P[π Query Results]
M --> Q[π― Answer Synthesis<br/>Groq LLM]
P --> Q
Q --> R[β
Final Answer]
Core Components:
- Document Ingestion: Multi-format file processing with validation
- Dual Storage: SQLite tables + FAISS vector embeddings
- Query Intelligence: LLM-powered query understanding and decomposition
- Hybrid Retrieval: SQL precision + semantic search
- Answer Generation: Context-aware response synthesis
TableRAG/
βββ π― Core Application
β βββ streamlit_app.py # Main Streamlit UI (268 lines)
β βββ run.py # CLI interface
β βββ app/ # Core logic modules
β βββ config.py # Environment configuration
β βββ pipeline/
β β βββ rag_pipeline.py # Main RAG orchestration (227 lines)
β βββ llm/
β β βββ groq_client.py # Groq API integration
β β βββ prompts.py # LLM prompt templates
β βββ database/
β β βββ sql_executor.py # SQLite operations (269 lines)
β βββ embeddings/
β β βββ embedder.py # Sentence Transformers wrapper
β βββ retrieval/
β β βββ faiss_index.py # FAISS vector operations
β βββ utils/
β βββ ingest.py # Multi-format file processing (321 lines)
β βββ chunking.py # Text segmentation
β βββ logging.py # Centralized logging
β
βββ π Data & Storage
β βββ data/ # User data directories
β β βββ tables/ # CSV files (persistent)
β β βββ texts/ # Text documents (persistent)
β βββ db/ # SQLite databases
β β βββ tables.db # Persistent database (optional)
β βββ index/ # FAISS index files
β βββ faiss.index # Vector index (persistent)
β
βββ π¬ Assets & Documentation
β βββ assets/
β β βββ Screenshot 2025-10-09 230417.png # UI demo
β β βββ Screen Recording 2025-10-09 225828.mp4 # Video demo
β βββ test_assets/ # Sample files for testing
β β βββ test.csv
β β βββ report.pdf
β β βββ report.html
β βββ README.md # This documentation
β
βββ βοΈ Configuration
β βββ requirements.txt # Python dependencies
β βββ .env # Environment variables (create this)
β βββ .gitignore # Git exclusions
β βββ helper.py # Development utilities
β
βββ π Virtual Environment
βββ trag/ # Python virtual environment
- Python 3.12+ (recommended)
- Groq API Key (for LLM access)
- Git (for cloning)
git clone https://github.com/HemaKumar0077/TableRAG
cd TableRAG# Windows
python -m venv trag
trag\Scripts\activate
# macOS/Linux
python3 -m venv trag
source trag/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
# ===== REQUIRED CONFIGURATION =====
GROQ_API_KEY=gsk_your_groq_api_key_here
# ===== OPTIONAL CONFIGURATION =====
# Embedding Model (Hugging Face)
EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
# Database Settings
DB_TYPE=sqlite
SQLITE_DB_PATH=db/tables.db
# FAISS Index Configuration
FAISS_INDEX_PATH=index/faiss.index
# Retrieval Parameters
TOP_K_RETRIEVAL=5
MAX_ITERATIONS=1
# Logging
LOG_LEVEL=INFOπ How to get a Groq API Key:
- Visit console.groq.com
- Sign up/login with your account
- Navigate to "API Keys" section
- Create a new API key
- Copy and paste into your
.envfile
mkdir -p data/tables data/texts db indexstreamlit run streamlit_app.pyFeatures:
- π±οΈ Drag & Drop: Upload CSV, PDF, DOCX, TXT, MD files
- β‘ Real-time Processing: See upload progress and validation
- π§ Debug Mode: Inspect SQL queries and execution details
- π Interactive Results: View data tables and text chunks
β οΈ Error Handling: Clear feedback on processing issues
Workflow:
- Upload Files: Drag CSV files (β tables) and text files (β chunks)
- Process Documents: Click "π Process Documents"
- Ask Questions: Type natural language queries
- Get Answers: View synthesized responses with debug info
python run.pyExample Session:
π TableRAG CLI
Ask a question based on your text and table knowledge base.
π§ Enter your question: What was the total revenue by region?
β
Answer: Based on the sales data, the total revenue by region is...
--- Debug Info ---
π Retrieved Chunks: [relevant text excerpts]
π SQL Query: SELECT region, SUM(revenue) FROM sales_data GROUP BY region
π§Ύ SQL Result: [{"region": "North", "revenue": 150000}, ...]
π Table Analysis:
- "What is the total sales revenue across all regions?"
- "Which product had the highest growth rate?"
- "Show me all customers with orders above $10,000"
- "What is the average age of customers by location?"
π Document Search:
- "What are the key findings from the uploaded reports?"
- "Summarize the main recommendations in the documents"
- "What challenges were mentioned in the analysis?"
π Hybrid Queries:
- "Based on the sales data, what do the reports say about market trends?"
- "Compare the revenue figures with the strategic recommendations"
- Model: Llama-3.3-70B-Versatile
- API: OpenAI-compatible REST interface
- Functions: Query decomposition, SQL generation, answer synthesis
- Timeout: 30-second request limit with retry logic
- Algorithm: Inner Product (IP) for cosine similarity
- Embeddings: Sentence Transformers (384-dim by default)
- Storage: In-memory with optional persistence
- Performance: Sub-second search on 100K+ chunks
- Connection: Thread-safe, in-memory primary storage
- Features: Auto-schema inference, type detection, sanitization
- Safety: SQL injection protection, transaction management
- Validation: Comprehensive error handling and rollback
# Supported formats and processing
SUPPORTED_FORMATS = {
'CSV': 'Parsed β SQLite tables with type inference',
'PDF': 'Text extraction β chunked β vectorized',
'DOCX': 'Content extraction β chunked β vectorized',
'TXT/MD': 'Direct chunking β vectorized',
}| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
Required | Your Groq API authentication key |
EMBEDDING_MODEL_NAME |
all-MiniLM-L6-v2 |
Hugging Face model for embeddings |
SQLITE_DB_PATH |
db/tables.db |
Persistent SQLite database location |
FAISS_INDEX_PATH |
index/faiss.index |
FAISS vector index file path |
TOP_K_RETRIEVAL |
5 |
Number of text chunks to retrieve |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG/INFO/WARNING/ERROR) |
β "Failed to load embedding model"
# Solution: Install/update transformers
pip install --upgrade sentence-transformers torchβ "Groq API authentication failed"
# Check your .env file has the correct API key
echo $GROQ_API_KEY # Should show your keyβ "CSV parsing errors"
- Cause: Encoding issues or malformed CSV
- Solution: Check file encoding, verify CSV structure
- Debug: Enable "Show Debug Information" in UI
β "Empty query results"
- Cause: No relevant data found
- Solution: Verify files were processed successfully
- Check: File Information sidebar shows loaded tables/chunks
- Large CSVs: Files auto-process in 1000-row batches
- Memory Usage: Consider smaller
TOP_K_RETRIEVALvalues - Response Time: Use more specific queries for faster results
Log Location: app.log (rotating, 5MB max)
Log Levels Available:
LOG_LEVEL=DEBUG # Detailed query and processing info
LOG_LEVEL=INFO # Standard operational messages
LOG_LEVEL=WARNING # Issues that don't break functionality
LOG_LEVEL=ERROR # Critical errors requiring attentionKey Metrics Logged:
- File processing times and success rates
- SQL query execution and results
- Vector search performance
- LLM API response times and errors
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Groq - Fast LLM inference
- Hugging Face - Transformer models and embeddings
- FAISS - Efficient similarity search
- Streamlit - Rapid web app development
- SQLite - Embedded database engine
Built with β€οΈ for intelligent document analysis