PDF -> Markdown -> Embedding

PDF -> Markdown -> Embedding is a robust, open-source document and image processing pipeline. It leverages state-of-the-art tools like docling, easyocr, and langchain to extract, process, and convert PDF document into Markdown. It seamlessly integrates with pgvector for vector storage and MinIO for object storage, making it an ideal backend for Retrieval-Augmented Generation (RAG) and LLM-powered applications.

Features

OCR Capabilities: Built-in Optical Character Recognition using easyocr and torch.
Vector Storage: Store and query document embeddings efficiently using PostgreSQL with the pgvector extension.
Object Storage: Reliable file and asset storage using self-hosted MinIO.
LLM Integration: Ready-to-use LangChain and Ollama integrations for advanced text processing and embedding generation.

Vectorless (Tree-based) Approach

The project supports a "vectorless" processing mode that builds a hierarchical, tree-based representation of documents from their Markdown instead of producing embeddings for every chunk. Key points:

How it works: pages are combined and split by Markdown headers (#, ##, ###, ####) to produce a Tree where headings become nested nodes. Leaf nodes are summarized using an LLM and the rolled-up tree is stored for later retrieval.
When to use: useful when you want structured, section-level summaries or to reduce embedding costs by avoiding full vectorization of every chunk.
Enable it: set PROCESS_TYPE=vectorless in your .env (or update settings.process_type).
Implementation: the tree-based pipeline is implemented in src/storage/vectorless.py and orchestrated from src/main.py.

Prerequisites

Before you begin, ensure you have the following installed:

Docker and Docker Compose
Python 3.13+
uv (Astral's blazing-fast Python package manager)

To install uv, run:

curl -LsSf https://astral.sh/uv/install.sh | sh

Getting Started

Follow these steps to set up the project locally.

1. Environment Setup

Create a .env file in the root directory to configure your database, storage, and application settings.

Create a .env file with the following essential variables:

# PostgreSQL / pgvector settings
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=vector_db
POSTGRES_HOST=localhost
POSTGRES_PORT=5433

# MinIO settings
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=your_secure_password
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=admin
MINIO_SECRET_KEY=your_secure_password

2. Run Docker Containers

The project relies on PostgreSQL (with pgvector) and MinIO. Start the required infrastructure using Docker Compose. Spin up the containers in the background:

docker compose up -d

Note: This will build the custom pgvector image and start the MinIO server. MinIO console will be available at http://localhost:9001.

3. Install Dependencies

We use uv for fast and deterministic dependency management. Create a virtual environment and install the dependencies defined in pyproject.toml.

# Create a virtual environment and sync dependencies
uv sync

Alternatively, if you prefer to manage the virtual environment manually:

uv venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate
uv pip install -e .

4. Run the Application

Once the containers are healthy and dependencies are installed, you can run the main processing pipeline:

# If using uv sync
uv run src/main.py

# If virtual environment is activated
python src/main.py

Project Structure

md-converter/
├── docker-compose.yml      # Infrastructure orchestration
├── Dockerfile.pgvector     # Custom PostgreSQL + pgvector image
├── pyproject.toml          # Python dependencies and project metadata
├── src/
│   ├── main.py             # Application entry point
│   ├── config/             # Configuration and environment settings
│   ├── models/             # LLM factories and model definitions
│   ├── processing/         # Document and image processing logic
│   ├── storage/            # MinIO client and Vector Store integrations
│   └── utils/              # Logging and helper utilities
└── temp_files/             # Temporary directory for processing artifacts

Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
.python-version		.python-version
Dockerfile.pgvector		Dockerfile.pgvector
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF -> Markdown -> Embedding

Features

Vectorless (Tree-based) Approach

Prerequisites

Getting Started

1. Environment Setup

2. Run Docker Containers

3. Install Dependencies

4. Run the Application

Project Structure

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF -> Markdown -> Embedding

Features

Vectorless (Tree-based) Approach

Prerequisites

Getting Started

1. Environment Setup

2. Run Docker Containers

3. Install Dependencies

4. Run the Application

Project Structure

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages