A system to fetch scientific papers PDFs, and perform semantic search on their contents.
🎮 Papers, Please (whose branding we have absolutely not stolen)
(Todo)
First, configure your secrets (example provided at .env.example)
Then, just docker-compose up and you should be good to go.
System is split into 4 services
- Frontend: Built with React with the help of Claude
- Backend:
- REST API using FastAPI
fetchnew paper metadata using SemanticScholar's APIqueryregistered papers, enabling user to search inside PDFs
- Worker:
- Performs slow batch processing tasks
- Download paper PDFs automatically
- Extract text from PDF with RapidOCR
- Chunk text with Docling
HybridChunker - Embeds chunks and indexes to PineCone vector DB
- Postgres DB
Worker service is essentialy a cronjob which polls DB for pending PDFs, sequentially process a batch (Download -> OCR -> Chunk -> Embed), and requeues any failed documents.
At scale this is not ideal, because it creates a coupling between online tasks (backend service) and batch tasks: both keep hitting the same DB.
I'd like in the future to add something like Redis for the worker service, replacing this polling mechanism.
If you are on Nix, just nix develop to get the development shell from the flake. Optionally, direnv allow to automatically activate the shell when you cd in the project folder.
If you are a normal healthy person, use the amazing uv package manager to create a virtual environment with everything you need (for backend).

