feat: add assignment phase2 until phase4 #3

Alamnzr123 · 2025-10-23T17:40:53Z

This pull request introduces several new modules and significant enhancements to the document processing and retrieval pipeline for fund performance documents. The main improvements include robust PDF parsing, table classification, chunking for vector storage, embedding generation, and a new RAG (Retrieval-Augmented Generation) engine for semantic search and question answering. Error handling and background processing are also improved for reliability and scalability.

Document Processing and Chunking Enhancements:

Implemented robust PDF parsing in DocumentProcessor, extracting both text and tables, classifying tables, chunking text for vector storage, and optionally ingesting chunks into the RAG engine. Parsed results are saved as JSON for fallback. Improved error handling and statistics reporting. (backend/app/services/document_processor.py, backend/app/services/document_processor.pyL23-R244)
Added a simple whitespace-based chunker with overlap for splitting text into chunks suitable for vector storage. (backend/app/services/chunker.py, backend/app/services/chunker.pyR1-R24)

Table Classification Improvements:

Introduced a lightweight rule-based table classifier to identify table types (e.g., capital calls, distributions, adjustments) in fund reports, including confidence scoring. (backend/app/services/table_parser.py, backend/app/services/table_parser.pyR1-R64)

Semantic Search and RAG Engine:

Added a new RAGEngine module that supports ingesting document chunks into a FAISS vector store, generating embeddings via OpenAI or sentence-transformers, and retrieving relevant contexts for queries. Also supports optional LLM-based question answering. (backend/app/services/rag_engine.py, backend/app/services/rag_engine.pyR1-R66)
Implemented an embedding provider that uses OpenAI (if API key is available) or sentence-transformers as a fallback, supporting batch embedding of text chunks. (backend/app/services/embeddings.py, backend/app/services/embeddings.pyR1-R42)

API and Background Task Reliability:

Improved error handling and logging in chat and document endpoints, including robust background task management for document parsing, with fallback to in-process tasks if Celery is unavailable. (backend/app/api/endpoints/documents.py, [1] [2]; backend/app/api/endpoints/chat.py, [3]

These changes collectively enable more reliable, scalable, and intelligent processing and retrieval of fund performance documents, laying the foundation for advanced semantic search and question answering capabilities.

feat: add assignment phase2 until phase4

59ce9ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add assignment phase2 until phase4 #3

feat: add assignment phase2 until phase4 #3

Uh oh!

Alamnzr123 commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add assignment phase2 until phase4 #3

Are you sure you want to change the base?

feat: add assignment phase2 until phase4 #3

Uh oh!

Conversation

Alamnzr123 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alamnzr123 commented Oct 23, 2025 •

edited

Loading