Skip to content

basita512/TradePortfolio_Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Trade-portfolio Chatbot

A natural language chatbot for querying portfolio holdings and trades data using advanced RAG (Retrieval-Augmented Generation) architecture for dynamic SQL generation.

Why RAG-Enhanced Dynamic SQL Generation Architecture?

We implemented RAG-Enhanced Dynamic SQL Generation architecture for these key reasons:

  • Dynamic SQL: Generates queries in real-time works with any schema without hardcoding.
  • Context-Aware: Handles complex, unseen questions by understanding schema relationships.
  • RAG-Driven: Retrieves similar valid SQL examples to improve generation accuracy.
  • Secure: Implements multi-layer SQL injection prevention.
  • Self-Learning: Automatically feeds successful queries back into the system to improve future performance.

Technical Stack

Component Technology Purpose
LLM Groq Llama 3.3 70B Natural language to SQL conversion
Embeddings SentenceTransformers (all-MiniLM-L6-v2) Convert text to vectors for RAG
Vector Store FAISS Fast similarity search for SQL examples
Database SQLite Store and query CSV data
Framework LangChain Orchestrate LLM and database interactions
Data Processing Pandas, NumPy CSV manipulation and cleaning

πŸ—οΈ System Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CSV FILES                            β”‚
β”‚              holdings.csv (1,022 rows Γ— 25 cols)             β”‚
β”‚               trades.csv (649 rows Γ— 31 cols)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚ Pandas ETL Pipeline
                   ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   SQLITE DATABASE                            β”‚
β”‚  β€’ Column name sanitization (remove spaces/special chars)    β”‚
β”‚  β€’ Type inference (INTEGER, REAL, TEXT)                      β”‚
β”‚  β€’ Indexed tables for fast queries                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        ↓                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SCHEMA EXTRACTORβ”‚  β”‚  SQL SAMPLE GENERATOR    β”‚
β”‚  β€’ Table names   β”‚  β”‚  β€’ Pattern detection     β”‚
β”‚  β€’ Column types  β”‚  β”‚  β€’ 5 base examples       β”‚
β”‚  β€’ Relationships β”‚  β”‚  β€’ Question templates    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β”‚                       ↓
         β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚   SENTENCE TRANSFORMER   β”‚
         β”‚            β”‚   β€’ Encode to vectors    β”‚
         β”‚            β”‚   β€’ 384-dim embeddings   β”‚
         β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β”‚                       ↓
         β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚      FAISS INDEX         β”‚
         β”‚            β”‚   β€’ L2 distance search   β”‚
         β”‚            β”‚   β€’ Top-k retrieval      β”‚
         β”‚            β”‚   β€’ Self-learning updatesβ”‚
         β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  PROMPT BUILDER         β”‚
         β”‚  Schema + Examples +    β”‚
         β”‚  Question β†’ LLM Prompt  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   GROQ LLM API          β”‚
         β”‚   Llama 3.3 70B         β”‚
         β”‚   Temperature: 0        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓ SQL Query
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  SQL GUARDRAILS         β”‚
         β”‚  β€’ Starts with SELECT?  β”‚
         β”‚  β€’ No forbidden words?  β”‚
         β”‚  β€’ Auto-LIMIT injection β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  SQL EXECUTOR           β”‚
         β”‚  pandas.read_sql_query  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚       β”‚ (Success)
                   ↓       ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  SELF-LEARNING RAG      β”‚
         β”‚  β€’ Stores valid SQL     β”‚
         β”‚  β€’ Updates Vector Store β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  RESULT FORMATTER       β”‚
         β”‚  β€’ Single value         β”‚
         β”‚  β€’ Table (≀10 rows)     β”‚
         β”‚  β€’ Preview (>10 rows)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   ↓
              USER ANSWER

πŸš€ Quick Start

Option 1: Run with Docker (Recommended)

No Python installation needed. Just install Docker.

# 1. Clone the repo
git clone <repository_url>
cd <project_directory>

# 2. Add your API Key
# Create a .env file and add: GROQ_API_KEY=your_key_here

# 3. Run the App
docker-compose up --build

Then open the Jupyter link printed in the terminal.


Option 2: Run Manually (Local Python)

1. Setup Environment

# Windows
py -3.11 -m venv env311
.\env311\Scripts\activate

# Mac/Linux
python3.11 -m venv env311
source env311/bin/activate

2. Install Dependencies

pip install -r requirements.txt

3. Setup API Key Create a .env file in the project root:

GROQ_API_KEY=your_api_key_here

4. Run the Notebook

jupyter notebook trade_portfolio.ipynb

Performance Metrics

  • Database Size: 1,671 total records (1,022 holdings + 649 trades)
  • Schema Columns: 56 total columns across 2 tables
  • RAG Examples: 5 auto-generated SQL samples
  • Embedding Dimension: 384 (MiniLM model)
  • Query Latency: ~2-5 seconds (Groq API + SQL execution)
  • Accuracy: 100% on in-scope questions, 100% fallback on out-of-scope

How It Works

1. Data Ingestion

holdings_df = pd.read_csv("holdings.csv")  # 1,022 rows Γ— 25 columns
trades_df = pd.read_csv("trades.csv")      # 649 rows Γ— 31 columns

2. Schema Auto-Detection

schema_info = {
    'holdings': {
        'columns': ['AsOfDate', 'PortfolioName', 'MV_Base', 'PL_YTD', ...]
    },
    'trades': {
        'columns': ['TradeTypeName', 'SecurityId', 'Quantity', ...]
    }
}

3. SQL Sample Generation

sql_samples = [
    {
        "question": "How many total holdings are there?",
        "sql": "SELECT COUNT(*) FROM holdings;",
        "explanation": "Counts all rows"
    },
    # ... 4 more auto-generated samples
]

4. RAG Vector Store Creation

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode([sample['question'] for sample in sql_samples])
index = faiss.IndexFlatL2(dimension=384)
index.add(embeddings)

5. Query Processing with Self-Learning

def querygpt_chatbot(question):
    # Step 1: Retrieve similar SQL examples (RAG)
    relevant_docs = retrieve_relevant_examples(question, top_k=2)
    
    # Step 2: Build prompt with schema + examples
    prompt = f"{schema_text}\n{examples}\nQUESTION: {question}"
    
    # Step 3: Generate SQL via LLM
    sql = llm.invoke(prompt).content.strip()
    
    # Step 4: SQL Guardrails (Safety + Auto-LIMIT)
    if not is_safe_sql(sql):  # Block DROP, DELETE, etc.
        return "Only SELECT queries allowed"
    sql = enforce_limit(sql, limit=100)  # Add LIMIT if missing
    
    # Step 5: Execute query
    result = pd.read_sql_query(sql, engine)
    
    # Step 6: Self-Learning RAG - Store successful queries
    add_success_to_rag(question, sql)  # Updates vector store dynamically
    
    # Step 7: Format and return results
    return format_result(result)

6. Future Optimizations

The following optimizations can be done when dealing with larger datasets, currently not executed due to the small dataset size.

1. Intent Classification Agent: Routes queries into COUNT / PNL / FILTER / TOP-N categories. Saves LLM tokens and reduces SQL retries.

2. Table Selection Agent: Selects only relevant tables before SQL generation. Reduces schema size and confusion.

3. Column Pruning Agent: Sends only relevant columns instead of full schemas. Major token cost savings in large databases.

4. Metadata Gateway: Dynamically injects schema, data types, statistics, and sample rows. Prevents hallucinated columns.

5. Retry with Schema Expansion: If SQL fails due to missing columns, retry with expanded schema.

6. Query Audit Logging: Logs question β†’ SQL β†’ execution time β†’ status for monitoring.

About

Natural language chatbot for querying portfolio holdings and trades data using advanced RAG architecture for dynamic SQL generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors