Skip to content

Extract text from moodle files #83

@dantetemplar

Description

@dantetemplar

Describe the feature

Before indexing we must extract text from pdf, and other formats. It will be good to keep some structure: from which page this text, and so on.

Suggested solution

We can do it via pymupdf4llm or pypdf fastly. By default it will extract just text, without processing text in images (OCR) and so on.

Also there are more accurate (but slow) solutions:
https://github.com/dantetemplar/pdf-extraction-agenda/

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions