RAG STONKS 📈

A complete RAG (Retrieval-Augmented Generation) pipeline for analyzing financial documents with AI-powered vector search.

Project targeted to focus on 1 specific company mentioned below.
Inital documents processed are conference calls.
Further documents would be Annual reports and credit ratings / credit ratings.
We then move on to ai agents which automates the entire workflow.
AIM : to reduce mannual effort by ~95% (targetted).

Company: Deepak Nitrite Limited
Documents: 103 PDFs (FY11-FY25) - Annual Reports, Earnings Releases, Investor Presentations, Conference Calls

🚀 Quick Start

Prerequisites

Python 3.12+
uv package manager (recommended)

Installation

# 1. Clone the repository
git clone https://github.com/ketankauntia/RAG_STONKS
cd "RAG STONKS"

# 2. Create virtual environment
uv venv

# 3. Activate virtual environment
# Windows (Git Bash / PowerShell):
source .venv/Scripts/activate

# Linux / macOS:
source .venv/bin/activate

# 4. Install dependencies
uv sync

Running Pipeline Steps

Option 1: Using uv run (no activation needed)

uv run python rag-pipeline/step_a_ocr.py

Option 2: After activating venv

python rag-pipeline/step_a_ocr.py

📂 Project Structure

RAG STONKS/
├── pdfs/                      # Downloaded PDFs (103 files, 4 categories)
├── scrapper/                  # PDF scraper for financial documents
├── rag-pipeline/              # Main RAG pipeline
│   ├── step_a_ocr.py         # Extract text from PDFs
│   ├── step_b_chunking.py    # Chunk text with Gemini AI
│   ├── step_c_metadata.py    # Add metadata to chunks
│   ├── step_d_embeddings.py  # Create embeddings (OpenAI)
│   ├── supabase/             # Vector database setup
│   │   ├── schema.sql        # Database schema
│   │   ├── upload_embeddings.py  # Upload to Supabase
│   │   └── SETUP_GUIDE.md    # Detailed setup guide
│   ├── chunks/               # Chunked text (JSON)
│   ├── embeddings_ready/     # Embeddings (JSONL)
│   └── config.py             # Pipeline configuration
└── main.py                    # RAG query interface (coming soon)

🔄 Pipeline Steps

1️⃣ Scrape PDFs

python scrapper/online_pdf_scraper.py

Downloads 103 financial documents across 4 categories.

2️⃣ Extract Text (OCR)

uv run python rag-pipeline/step_a_ocr.py

Extracts text from PDFs using PyMuPDF with confidence scoring.

3️⃣ Chunk Documents

uv run python rag-pipeline/step_b_chunking.py

Uses Gemini 2.0 Flash to intelligently chunk documents into:

Statements (executive statements, financial data)
Q&A (question-answer pairs from calls)

4️⃣ Add Metadata

uv run python rag-pipeline/step_c_metadata.py

Enriches chunks with fiscal year, quarter, document type metadata.

5️⃣ Create Embeddings

uv run python rag-pipeline/step_d_embeddings.py

Generates 1,536-dimensional vectors using OpenAI text-embedding-3-small.

6️⃣ Upload to Supabase

uv run python rag-pipeline/supabase/upload_embeddings.py

Stores embeddings in Supabase with pgvector for similarity search.

See rag-pipeline/supabase/SETUP_GUIDE.md for Supabase setup.

⚙️ Configuration

Create a .env file in the project root:

# Gemini AI (for chunking)
GEMINI_API_KEY=your_gemini_api_key

# OpenAI (for embeddings)
OPENAI_API_KEY=your_openai_api_key
OPENAI_ORG_ID=org_xxxxx
OPENAI_PROJECT_ID=proj_xxxxx

# Supabase (for vector storage)
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_KEY=your_service_role_key

📊 Current Status

✅ 103 PDFs downloaded and organized
✅ 16 files processed through OCR
✅ 1,567 chunks created (987 statements + 580 Q&As)
✅ 1,567 embeddings generated (1,536 dimensions each)
⏳ Upload to Supabase (ready to run)
⏳ Query interface (coming soon)

🛠️ Tech Stack

Component	Technology
Text Extraction	PyMuPDF, PDFPlumber
AI Chunking	Google Gemini 2.0 Flash
Embeddings	OpenAI text-embedding-3-small
Vector Database	Supabase (PostgreSQL + pgvector)
Package Manager	uv
Web Scraping	BeautifulSoup4, Requests

📝 License

Paid for commercial use.

🙏 Acknowledgments

Data source: Undisclosed

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
rag-pipeline		rag-pipeline
scrapper		scrapper
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG STONKS 📈

🚀 Quick Start

Prerequisites

Installation

Running Pipeline Steps

📂 Project Structure

🔄 Pipeline Steps

1️⃣ Scrape PDFs

2️⃣ Extract Text (OCR)

3️⃣ Chunk Documents

4️⃣ Add Metadata

5️⃣ Create Embeddings

6️⃣ Upload to Supabase

⚙️ Configuration

📊 Current Status

🛠️ Tech Stack

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG STONKS 📈

🚀 Quick Start

Prerequisites

Installation

Running Pipeline Steps

📂 Project Structure

🔄 Pipeline Steps

1️⃣ Scrape PDFs

2️⃣ Extract Text (OCR)

3️⃣ Chunk Documents

4️⃣ Add Metadata

5️⃣ Create Embeddings

6️⃣ Upload to Supabase

⚙️ Configuration

📊 Current Status

🛠️ Tech Stack

📝 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages