-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Describe the feature
Before indexing we must extract text from pdf, and other formats. It will be good to keep some structure: from which page this text, and so on.
Suggested solution
We can do it via pymupdf4llm or pypdf fastly. By default it will extract just text, without processing text in images (OCR) and so on.
Also there are more accurate (but slow) solutions:
https://github.com/dantetemplar/pdf-extraction-agenda/
Additional context
No response
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
📋 Backlog