Skip to content

ketankauntia/RAG_STONKS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG STONKS 📈

A complete RAG (Retrieval-Augmented Generation) pipeline for analyzing financial documents with AI-powered vector search.

  • Project targeted to focus on 1 specific company mentioned below.

  • Inital documents processed are conference calls.

  • Further documents would be Annual reports and credit ratings / credit ratings.

  • We then move on to ai agents which automates the entire workflow.

  • AIM : to reduce mannual effort by ~95% (targetted).

Company: Deepak Nitrite Limited
Documents: 103 PDFs (FY11-FY25) - Annual Reports, Earnings Releases, Investor Presentations, Conference Calls


🚀 Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager (recommended)

Installation

# 1. Clone the repository
git clone https://github.com/ketankauntia/RAG_STONKS
cd "RAG STONKS"

# 2. Create virtual environment
uv venv

# 3. Activate virtual environment
# Windows (Git Bash / PowerShell):
source .venv/Scripts/activate

# Linux / macOS:
source .venv/bin/activate

# 4. Install dependencies
uv sync

Running Pipeline Steps

Option 1: Using uv run (no activation needed)

uv run python rag-pipeline/step_a_ocr.py

Option 2: After activating venv

python rag-pipeline/step_a_ocr.py

📂 Project Structure

RAG STONKS/
├── pdfs/                      # Downloaded PDFs (103 files, 4 categories)
├── scrapper/                  # PDF scraper for financial documents
├── rag-pipeline/              # Main RAG pipeline
│   ├── step_a_ocr.py         # Extract text from PDFs
│   ├── step_b_chunking.py    # Chunk text with Gemini AI
│   ├── step_c_metadata.py    # Add metadata to chunks
│   ├── step_d_embeddings.py  # Create embeddings (OpenAI)
│   ├── supabase/             # Vector database setup
│   │   ├── schema.sql        # Database schema
│   │   ├── upload_embeddings.py  # Upload to Supabase
│   │   └── SETUP_GUIDE.md    # Detailed setup guide
│   ├── chunks/               # Chunked text (JSON)
│   ├── embeddings_ready/     # Embeddings (JSONL)
│   └── config.py             # Pipeline configuration
└── main.py                    # RAG query interface (coming soon)

🔄 Pipeline Steps

1️⃣ Scrape PDFs

python scrapper/online_pdf_scraper.py

Downloads 103 financial documents across 4 categories.

2️⃣ Extract Text (OCR)

uv run python rag-pipeline/step_a_ocr.py

Extracts text from PDFs using PyMuPDF with confidence scoring.

3️⃣ Chunk Documents

uv run python rag-pipeline/step_b_chunking.py

Uses Gemini 2.0 Flash to intelligently chunk documents into:

  • Statements (executive statements, financial data)
  • Q&A (question-answer pairs from calls)

4️⃣ Add Metadata

uv run python rag-pipeline/step_c_metadata.py

Enriches chunks with fiscal year, quarter, document type metadata.

5️⃣ Create Embeddings

uv run python rag-pipeline/step_d_embeddings.py

Generates 1,536-dimensional vectors using OpenAI text-embedding-3-small.

6️⃣ Upload to Supabase

uv run python rag-pipeline/supabase/upload_embeddings.py

Stores embeddings in Supabase with pgvector for similarity search.

See rag-pipeline/supabase/SETUP_GUIDE.md for Supabase setup.


⚙️ Configuration

Create a .env file in the project root:

# Gemini AI (for chunking)
GEMINI_API_KEY=your_gemini_api_key

# OpenAI (for embeddings)
OPENAI_API_KEY=your_openai_api_key
OPENAI_ORG_ID=org_xxxxx
OPENAI_PROJECT_ID=proj_xxxxx

# Supabase (for vector storage)
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_KEY=your_service_role_key

📊 Current Status

  • 103 PDFs downloaded and organized
  • 16 files processed through OCR
  • 1,567 chunks created (987 statements + 580 Q&As)
  • 1,567 embeddings generated (1,536 dimensions each)
  • ⏳ Upload to Supabase (ready to run)
  • ⏳ Query interface (coming soon)

🛠️ Tech Stack

Component Technology
Text Extraction PyMuPDF, PDFPlumber
AI Chunking Google Gemini 2.0 Flash
Embeddings OpenAI text-embedding-3-small
Vector Database Supabase (PostgreSQL + pgvector)
Package Manager uv
Web Scraping BeautifulSoup4, Requests

📝 License

Paid for commercial use.


🙏 Acknowledgments

Data source: Undisclosed

About

stonks :) Complete RAG pipeline for evaluating financial documents of listed companies. Aim to reducing mannal analysis type ~95%.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors