Skip to content

Conversation

@Alamnzr123
Copy link

@Alamnzr123 Alamnzr123 commented Oct 23, 2025

This pull request introduces several new modules and significant enhancements to the document processing and retrieval pipeline for fund performance documents. The main improvements include robust PDF parsing, table classification, chunking for vector storage, embedding generation, and a new RAG (Retrieval-Augmented Generation) engine for semantic search and question answering. Error handling and background processing are also improved for reliability and scalability.

Document Processing and Chunking Enhancements:

  • Implemented robust PDF parsing in DocumentProcessor, extracting both text and tables, classifying tables, chunking text for vector storage, and optionally ingesting chunks into the RAG engine. Parsed results are saved as JSON for fallback. Improved error handling and statistics reporting. (backend/app/services/document_processor.py, backend/app/services/document_processor.pyL23-R244)
  • Added a simple whitespace-based chunker with overlap for splitting text into chunks suitable for vector storage. (backend/app/services/chunker.py, backend/app/services/chunker.pyR1-R24)

Table Classification Improvements:

  • Introduced a lightweight rule-based table classifier to identify table types (e.g., capital calls, distributions, adjustments) in fund reports, including confidence scoring. (backend/app/services/table_parser.py, backend/app/services/table_parser.pyR1-R64)

Semantic Search and RAG Engine:

  • Added a new RAGEngine module that supports ingesting document chunks into a FAISS vector store, generating embeddings via OpenAI or sentence-transformers, and retrieving relevant contexts for queries. Also supports optional LLM-based question answering. (backend/app/services/rag_engine.py, backend/app/services/rag_engine.pyR1-R66)
  • Implemented an embedding provider that uses OpenAI (if API key is available) or sentence-transformers as a fallback, supporting batch embedding of text chunks. (backend/app/services/embeddings.py, backend/app/services/embeddings.pyR1-R42)

API and Background Task Reliability:

  • Improved error handling and logging in chat and document endpoints, including robust background task management for document parsing, with fallback to in-process tasks if Celery is unavailable. (backend/app/api/endpoints/documents.py, [1] [2]; backend/app/api/endpoints/chat.py, [3]

These changes collectively enable more reliable, scalable, and intelligent processing and retrieval of fund performance documents, laying the foundation for advanced semantic search and question answering capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant