Skip to content

lazzyms/pdf-markdown-embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF -> Markdown -> Embedding

Python Version Code style: black

PDF -> Markdown -> Embedding is a robust, open-source document and image processing pipeline. It leverages state-of-the-art tools like docling, easyocr, and langchain to extract, process, and convert PDF document into Markdown. It seamlessly integrates with pgvector for vector storage and MinIO for object storage, making it an ideal backend for Retrieval-Augmented Generation (RAG) and LLM-powered applications.


Features

  • OCR Capabilities: Built-in Optical Character Recognition using easyocr and torch.
  • Vector Storage: Store and query document embeddings efficiently using PostgreSQL with the pgvector extension.
  • Object Storage: Reliable file and asset storage using self-hosted MinIO.
  • LLM Integration: Ready-to-use LangChain and Ollama integrations for advanced text processing and embedding generation.

Vectorless (Tree-based) Approach

The project supports a "vectorless" processing mode that builds a hierarchical, tree-based representation of documents from their Markdown instead of producing embeddings for every chunk. Key points:

  • How it works: pages are combined and split by Markdown headers (#, ##, ###, ####) to produce a Tree where headings become nested nodes. Leaf nodes are summarized using an LLM and the rolled-up tree is stored for later retrieval.
  • When to use: useful when you want structured, section-level summaries or to reduce embedding costs by avoiding full vectorization of every chunk.
  • Enable it: set PROCESS_TYPE=vectorless in your .env (or update settings.process_type).
  • Implementation: the tree-based pipeline is implemented in src/storage/vectorless.py and orchestrated from src/main.py.

Prerequisites

Before you begin, ensure you have the following installed:

To install uv, run:

curl -LsSf https://astral.sh/uv/install.sh | sh

Getting Started

Follow these steps to set up the project locally.

1. Environment Setup

Create a .env file in the root directory to configure your database, storage, and application settings.

Create a .env file with the following essential variables:

# PostgreSQL / pgvector settings
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=vector_db
POSTGRES_HOST=localhost
POSTGRES_PORT=5433

# MinIO settings
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=your_secure_password
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=admin
MINIO_SECRET_KEY=your_secure_password

2. Run Docker Containers

The project relies on PostgreSQL (with pgvector) and MinIO. Start the required infrastructure using Docker Compose. Spin up the containers in the background:

docker compose up -d

Note: This will build the custom pgvector image and start the MinIO server. MinIO console will be available at http://localhost:9001.

3. Install Dependencies

We use uv for fast and deterministic dependency management. Create a virtual environment and install the dependencies defined in pyproject.toml.

# Create a virtual environment and sync dependencies
uv sync

Alternatively, if you prefer to manage the virtual environment manually:

uv venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate
uv pip install -e .

4. Run the Application

Once the containers are healthy and dependencies are installed, you can run the main processing pipeline:

# If using uv sync
uv run src/main.py

# If virtual environment is activated
python src/main.py

Project Structure

md-converter/
├── docker-compose.yml      # Infrastructure orchestration
├── Dockerfile.pgvector     # Custom PostgreSQL + pgvector image
├── pyproject.toml          # Python dependencies and project metadata
├── src/
│   ├── main.py             # Application entry point
│   ├── config/             # Configuration and environment settings
│   ├── models/             # LLM factories and model definitions
│   ├── processing/         # Document and image processing logic
│   ├── storage/            # MinIO client and Vector Store integrations
│   └── utils/              # Logging and helper utilities
└── temp_files/             # Temporary directory for processing artifacts

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Releases

No releases published

Packages

 
 
 

Contributors

Languages