A natural language chatbot for querying portfolio holdings and trades data using advanced RAG (Retrieval-Augmented Generation) architecture for dynamic SQL generation.
We implemented RAG-Enhanced Dynamic SQL Generation architecture for these key reasons:
- Dynamic SQL: Generates queries in real-time works with any schema without hardcoding.
- Context-Aware: Handles complex, unseen questions by understanding schema relationships.
- RAG-Driven: Retrieves similar valid SQL examples to improve generation accuracy.
- Secure: Implements multi-layer SQL injection prevention.
- Self-Learning: Automatically feeds successful queries back into the system to improve future performance.
| Component | Technology | Purpose |
|---|---|---|
| LLM | Groq Llama 3.3 70B | Natural language to SQL conversion |
| Embeddings | SentenceTransformers (all-MiniLM-L6-v2) | Convert text to vectors for RAG |
| Vector Store | FAISS | Fast similarity search for SQL examples |
| Database | SQLite | Store and query CSV data |
| Framework | LangChain | Orchestrate LLM and database interactions |
| Data Processing | Pandas, NumPy | CSV manipulation and cleaning |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CSV FILES β
β holdings.csv (1,022 rows Γ 25 cols) β
β trades.csv (649 rows Γ 31 cols) β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β Pandas ETL Pipeline
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SQLITE DATABASE β
β β’ Column name sanitization (remove spaces/special chars) β
β β’ Type inference (INTEGER, REAL, TEXT) β
β β’ Indexed tables for fast queries β
ββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββ΄βββββββββββ
β β
ββββββββββββββββββββ ββββββββββββββββββββββββββββ
β SCHEMA EXTRACTORβ β SQL SAMPLE GENERATOR β
β β’ Table names β β β’ Pattern detection β
β β’ Column types β β β’ 5 base examples β
β β’ Relationships β β β’ Question templates β
ββββββββββ¬ββββββββββ ββββββββββββ¬ββββββββββββββββ
β β
β β
β ββββββββββββββββββββββββββββ
β β SENTENCE TRANSFORMER β
β β β’ Encode to vectors β
β β β’ 384-dim embeddings β
β ββββββββββββ¬ββββββββββββββββ
β β
β β
β ββββββββββββββββββββββββββββ
β β FAISS INDEX β
β β β’ L2 distance search β
β β β’ Top-k retrieval β
β β β’ Self-learning updatesβ
β ββββββββββββ¬ββββββββββββββββ
β β
βββββββββββββββββββββββββ
β
β
βββββββββββββββββββββββββββ
β PROMPT BUILDER β
β Schema + Examples + β
β Question β LLM Prompt β
βββββββββββ¬ββββββββββββββββ
β
β
βββββββββββββββββββββββββββ
β GROQ LLM API β
β Llama 3.3 70B β
β Temperature: 0 β
βββββββββββ¬ββββββββββββββββ
β
β SQL Query
βββββββββββββββββββββββββββ
β SQL GUARDRAILS β
β β’ Starts with SELECT? β
β β’ No forbidden words? β
β β’ Auto-LIMIT injection β
βββββββββββ¬ββββββββββββββββ
β
β
βββββββββββββββββββββββββββ
β SQL EXECUTOR β
β pandas.read_sql_query β
βββββββββββ¬ββββββββ¬ββββββββ
β β (Success)
β β
βββββββββββ΄ββββββββ΄ββββββββ
β SELF-LEARNING RAG β
β β’ Stores valid SQL β
β β’ Updates Vector Store β
βββββββββββ¬ββββββββββββββββ
β
β
βββββββββββββββββββββββββββ
β RESULT FORMATTER β
β β’ Single value β
β β’ Table (β€10 rows) β
β β’ Preview (>10 rows) β
βββββββββββ¬ββββββββββββββββ
β
β
USER ANSWER
No Python installation needed. Just install Docker.
# 1. Clone the repo
git clone <repository_url>
cd <project_directory>
# 2. Add your API Key
# Create a .env file and add: GROQ_API_KEY=your_key_here
# 3. Run the App
docker-compose up --buildThen open the Jupyter link printed in the terminal.
1. Setup Environment
# Windows
py -3.11 -m venv env311
.\env311\Scripts\activate
# Mac/Linux
python3.11 -m venv env311
source env311/bin/activate2. Install Dependencies
pip install -r requirements.txt3. Setup API Key
Create a .env file in the project root:
GROQ_API_KEY=your_api_key_here4. Run the Notebook
jupyter notebook trade_portfolio.ipynb- Database Size: 1,671 total records (1,022 holdings + 649 trades)
- Schema Columns: 56 total columns across 2 tables
- RAG Examples: 5 auto-generated SQL samples
- Embedding Dimension: 384 (MiniLM model)
- Query Latency: ~2-5 seconds (Groq API + SQL execution)
- Accuracy: 100% on in-scope questions, 100% fallback on out-of-scope
holdings_df = pd.read_csv("holdings.csv") # 1,022 rows Γ 25 columns
trades_df = pd.read_csv("trades.csv") # 649 rows Γ 31 columnsschema_info = {
'holdings': {
'columns': ['AsOfDate', 'PortfolioName', 'MV_Base', 'PL_YTD', ...]
},
'trades': {
'columns': ['TradeTypeName', 'SecurityId', 'Quantity', ...]
}
}sql_samples = [
{
"question": "How many total holdings are there?",
"sql": "SELECT COUNT(*) FROM holdings;",
"explanation": "Counts all rows"
},
# ... 4 more auto-generated samples
]embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode([sample['question'] for sample in sql_samples])
index = faiss.IndexFlatL2(dimension=384)
index.add(embeddings)def querygpt_chatbot(question):
# Step 1: Retrieve similar SQL examples (RAG)
relevant_docs = retrieve_relevant_examples(question, top_k=2)
# Step 2: Build prompt with schema + examples
prompt = f"{schema_text}\n{examples}\nQUESTION: {question}"
# Step 3: Generate SQL via LLM
sql = llm.invoke(prompt).content.strip()
# Step 4: SQL Guardrails (Safety + Auto-LIMIT)
if not is_safe_sql(sql): # Block DROP, DELETE, etc.
return "Only SELECT queries allowed"
sql = enforce_limit(sql, limit=100) # Add LIMIT if missing
# Step 5: Execute query
result = pd.read_sql_query(sql, engine)
# Step 6: Self-Learning RAG - Store successful queries
add_success_to_rag(question, sql) # Updates vector store dynamically
# Step 7: Format and return results
return format_result(result)The following optimizations can be done when dealing with larger datasets, currently not executed due to the small dataset size.
1. Intent Classification Agent: Routes queries into COUNT / PNL / FILTER / TOP-N categories. Saves LLM tokens and reduces SQL retries.
2. Table Selection Agent: Selects only relevant tables before SQL generation. Reduces schema size and confusion.
3. Column Pruning Agent: Sends only relevant columns instead of full schemas. Major token cost savings in large databases.
4. Metadata Gateway: Dynamically injects schema, data types, statistics, and sample rows. Prevents hallucinated columns.
5. Retry with Schema Expansion: If SQL fails due to missing columns, retry with expanded schema.
6. Query Audit Logging: Logs question β SQL β execution time β status for monitoring.