Duckling extracts textual content and image descriptions from common
document formats (PDF, images, CSV/XLSX) and returns them as
standardized LangChain Document objects for downstream indexing or
retrieval-augmented generation.
- Detects the input file format and dispatches to a dedicated converter.
- PDF converter uses Docling for parsing and OCR, extracting text and
- Drawing PDFs are handled by a drawing-focused pipeline that prompts
- Images are encoded and described via an LLM prompt; tables are
- Docling / docling_core: PDF parsing, OCR and image artifact extraction.
- LangChain / langchain_core: Standard
Documentmodel used as output. - LangGraph: Small state graph to route files by detected format.
- OpenAI-compatible LLM (via
langchain_openai.ChatOpenAI): image and drawing description prompts and refinement. - PyMuPDF (
fitz) and OpenCV (cv2): page rendering and image handling. - Transformers (HuggingFace tokenizer): token-aware chunking for text.
- Create and activate a virtual environment and install dependencies:
python -m venv .venv
& .\.venv\Scripts\Activate.ps1
poetry install-
Ensure LLM credentials and any environment settings are available (for example, place keys in a
.envfile read by the app). -
Example usage (Python):
from duckling.graph import DucklingGraph
graph = DucklingGraph()
state = graph.run(r"C:\path\to\file.pdf", namespace="my-namespace")
documents = state.get("documents", [])