A complete RAG (Retrieval-Augmented Generation) pipeline for analyzing financial documents with AI-powered vector search.
-
Project targeted to focus on 1 specific company mentioned below.
-
Inital documents processed are conference calls.
-
Further documents would be Annual reports and credit ratings / credit ratings.
-
We then move on to ai agents which automates the entire workflow.
-
AIM : to reduce mannual effort by ~95% (targetted).
Company: Deepak Nitrite Limited
Documents: 103 PDFs (FY11-FY25) - Annual Reports, Earnings Releases, Investor Presentations, Conference Calls
- Python 3.12+
- uv package manager (recommended)
# 1. Clone the repository
git clone https://github.com/ketankauntia/RAG_STONKS
cd "RAG STONKS"
# 2. Create virtual environment
uv venv
# 3. Activate virtual environment
# Windows (Git Bash / PowerShell):
source .venv/Scripts/activate
# Linux / macOS:
source .venv/bin/activate
# 4. Install dependencies
uv syncOption 1: Using uv run (no activation needed)
uv run python rag-pipeline/step_a_ocr.pyOption 2: After activating venv
python rag-pipeline/step_a_ocr.pyRAG STONKS/
├── pdfs/ # Downloaded PDFs (103 files, 4 categories)
├── scrapper/ # PDF scraper for financial documents
├── rag-pipeline/ # Main RAG pipeline
│ ├── step_a_ocr.py # Extract text from PDFs
│ ├── step_b_chunking.py # Chunk text with Gemini AI
│ ├── step_c_metadata.py # Add metadata to chunks
│ ├── step_d_embeddings.py # Create embeddings (OpenAI)
│ ├── supabase/ # Vector database setup
│ │ ├── schema.sql # Database schema
│ │ ├── upload_embeddings.py # Upload to Supabase
│ │ └── SETUP_GUIDE.md # Detailed setup guide
│ ├── chunks/ # Chunked text (JSON)
│ ├── embeddings_ready/ # Embeddings (JSONL)
│ └── config.py # Pipeline configuration
└── main.py # RAG query interface (coming soon)
python scrapper/online_pdf_scraper.pyDownloads 103 financial documents across 4 categories.
uv run python rag-pipeline/step_a_ocr.pyExtracts text from PDFs using PyMuPDF with confidence scoring.
uv run python rag-pipeline/step_b_chunking.pyUses Gemini 2.0 Flash to intelligently chunk documents into:
- Statements (executive statements, financial data)
- Q&A (question-answer pairs from calls)
uv run python rag-pipeline/step_c_metadata.pyEnriches chunks with fiscal year, quarter, document type metadata.
uv run python rag-pipeline/step_d_embeddings.pyGenerates 1,536-dimensional vectors using OpenAI text-embedding-3-small.
uv run python rag-pipeline/supabase/upload_embeddings.pyStores embeddings in Supabase with pgvector for similarity search.
See rag-pipeline/supabase/SETUP_GUIDE.md for Supabase setup.
Create a .env file in the project root:
# Gemini AI (for chunking)
GEMINI_API_KEY=your_gemini_api_key
# OpenAI (for embeddings)
OPENAI_API_KEY=your_openai_api_key
OPENAI_ORG_ID=org_xxxxx
OPENAI_PROJECT_ID=proj_xxxxx
# Supabase (for vector storage)
SUPABASE_URL=https://xxxxx.supabase.co
SUPABASE_KEY=your_service_role_key- ✅ 103 PDFs downloaded and organized
- ✅ 16 files processed through OCR
- ✅ 1,567 chunks created (987 statements + 580 Q&As)
- ✅ 1,567 embeddings generated (1,536 dimensions each)
- ⏳ Upload to Supabase (ready to run)
- ⏳ Query interface (coming soon)
| Component | Technology |
|---|---|
| Text Extraction | PyMuPDF, PDFPlumber |
| AI Chunking | Google Gemini 2.0 Flash |
| Embeddings | OpenAI text-embedding-3-small |
| Vector Database | Supabase (PostgreSQL + pgvector) |
| Package Manager | uv |
| Web Scraping | BeautifulSoup4, Requests |
Paid for commercial use.
Data source: Undisclosed