This project builds a Retrieval-Augmented Generation (RAG) pipeline using Kedro for orchestration and Streamlit for interactive querying of corporate annual reports. It enables structured ingestion, intelligent chunking and querying of investor documents such as 10-Ks, 10-Qs and corporate annual reports.
conda create -n rag_ar python=3.11
conda activate rag_arpip install -r requirements.txtkedro vizkedro runThis executes the data ingestion and document processing pipeline defined in src/.
Launch the Streamlit frontend for interacting with the RAG system:
streamlit run src/rag_ar/interface/streamlit_app.pyThis project processes full-length annual reports which often contain extraneous pages (logos, images, cover pages) and only extracts the relevant text chunks for semantic retrieval using embedding-based search.
Key features:
- PDF ingestion and filtering of low-content pages (e.g. image-only or logo pages).
- Vector store construction using sentence embeddings.
- Semantic search and generation (combines retrieved chunks with a language model to answer user queries).
- Kedro pipelines manage ETL steps for maintainability and reproducibility.
- Streamlit interface enables easy querying by users.
- Designed investor annual reports contain more noise than regulatory filings like 10-Ks.
- Removing pages with minimal text (<200 characters) improves retrieval quality and reduces vector store size.
- Embedding-based retrieval combined with filtered chunks provides reliable responses to complex user queries.
- Image-to-text OCR: Process image-based text (e.g. scanned financials).
- Metadata tagging: Extract and structure metadata (e.g. fiscal year, CEO name).
- Fine-tuned summarization: Generate executive summaries of entire reports.
- Evaluation framework: Implement RAG benchmarks using standard QA metrics.