PDF -> Markdown -> Embedding is a robust, open-source document and image processing pipeline. It leverages state-of-the-art tools like docling, easyocr, and langchain to extract, process, and convert PDF document into Markdown. It seamlessly integrates with pgvector for vector storage and MinIO for object storage, making it an ideal backend for Retrieval-Augmented Generation (RAG) and LLM-powered applications.
- OCR Capabilities: Built-in Optical Character Recognition using
easyocrandtorch. - Vector Storage: Store and query document embeddings efficiently using PostgreSQL with the
pgvectorextension. - Object Storage: Reliable file and asset storage using self-hosted MinIO.
- LLM Integration: Ready-to-use LangChain and Ollama integrations for advanced text processing and embedding generation.
The project supports a "vectorless" processing mode that builds a hierarchical, tree-based representation of documents from their Markdown instead of producing embeddings for every chunk. Key points:
- How it works: pages are combined and split by Markdown headers (#, ##, ###, ####) to
produce a
Treewhere headings become nested nodes. Leaf nodes are summarized using an LLM and the rolled-up tree is stored for later retrieval. - When to use: useful when you want structured, section-level summaries or to reduce embedding costs by avoiding full vectorization of every chunk.
- Enable it: set
PROCESS_TYPE=vectorlessin your.env(or updatesettings.process_type). - Implementation: the tree-based pipeline is implemented in src/storage/vectorless.py and orchestrated from src/main.py.
Before you begin, ensure you have the following installed:
- Docker and Docker Compose
- Python 3.13+
- uv (Astral's blazing-fast Python package manager)
To install uv, run:
curl -LsSf https://astral.sh/uv/install.sh | shFollow these steps to set up the project locally.
Create a .env file in the root directory to configure your database, storage, and application settings.
Create a .env file with the following essential variables:
# PostgreSQL / pgvector settings
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=vector_db
POSTGRES_HOST=localhost
POSTGRES_PORT=5433
# MinIO settings
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=your_secure_password
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=admin
MINIO_SECRET_KEY=your_secure_passwordThe project relies on PostgreSQL (with pgvector) and MinIO. Start the required infrastructure using Docker Compose. Spin up the containers in the background:
docker compose up -dNote: This will build the custom pgvector image and start the MinIO server. MinIO console will be available at http://localhost:9001.
We use uv for fast and deterministic dependency management. Create a virtual environment and install the dependencies defined in pyproject.toml.
# Create a virtual environment and sync dependencies
uv syncAlternatively, if you prefer to manage the virtual environment manually:
uv venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
uv pip install -e .Once the containers are healthy and dependencies are installed, you can run the main processing pipeline:
# If using uv sync
uv run src/main.py
# If virtual environment is activated
python src/main.pymd-converter/
├── docker-compose.yml # Infrastructure orchestration
├── Dockerfile.pgvector # Custom PostgreSQL + pgvector image
├── pyproject.toml # Python dependencies and project metadata
├── src/
│ ├── main.py # Application entry point
│ ├── config/ # Configuration and environment settings
│ ├── models/ # LLM factories and model definitions
│ ├── processing/ # Document and image processing logic
│ ├── storage/ # MinIO client and Vector Store integrations
│ └── utils/ # Logging and helper utilities
└── temp_files/ # Temporary directory for processing artifacts
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request