- Research indicates that SmolDocling is an efficient vision-language model designed for document conversion, with only 256 million parameters.
- It appears capable of handling various document types, such as business documents, academic papers, and technical reports, recognizing elements like tables, charts, and code.
- Usage methods include Python and command-line interface (CLI), with documentation and examples available on the official website.
- Evidence shows it processes documents at 0.35 seconds per page on consumer-grade GPUs, offering high efficiency and low resource demands.
SmolDocling is a compact vision-language model developed by IBM Research and Hugging Face, designed for end-to-end document conversion. It excels at processing complex digital documents—such as business documents, academic papers, technical reports, patents, and tables—transforming them into structured, machine-readable formats. Unlike traditional approaches that rely on large foundation models or manual pipelines, SmolDocling is a single, efficient model with just 256 million parameters, making it 5-10 times smaller than comparable vision-language models (VLMs) while maintaining competitive performance.
It achieves this by generating DocTags, a novel universal markup format that captures all page elements, including content, structure, and spatial positioning. SmolDocling supports the recognition and reproduction of diverse elements like code listings, tables, equations, charts, lists, headings, and footnotes, making it particularly adept at handling complex layouts and nested structures.
SmolDocling offers multiple usage options to suit different user needs:
- Python Usage: Process documents using the
convert()function from the Docling library. For example, you can provide a document URL or local file path for conversion. Required libraries includetorch,docling_core, andtransformers. - CLI Usage: Use the
doclingcommand-line tool, e.g.,docling https://arxiv.org/pdf/2206.01062to convert a PDF. To specify SmolDocling, add the--pipeline vlm --vlm-model smoldoclingoptions. - Inference Methods: Supports various frameworks, including Transformers for single-page image inference, VLLM for fast batch processing, ONNX inference, and MLX local inference for Apple Silicon.
Further details are available at SmolDocling-256M-preview and Docling Documentation.
Official resources for SmolDocling include:
- Official Documentation: Covers installation, usage, concepts, recipes, and extensions, accessible at Docling Documentation.
- Examples: Hands-on examples for various use cases are available at Docling Examples. Specific SmolDocling examples can be found at SmolDocling-256M-preview.
- Demo: A live demo on Hugging Face Spaces allows users to upload documents and view outputs, accessible at SmolDocling-256M-Demo.
SmolDocling is a compact vision-language model developed collaboratively by IBM Research and Hugging Face for end-to-end multimodal document conversion. Released on March 13, 2025, as of the current date, March 20, 2025, at 11:13 AM PDT, it stands out for its efficiency and low resource requirements, making it ideal for structured content extraction from complex documents. Below is a comprehensive analysis of SmolDocling’s introduction, usage, documentation, capabilities, and use cases.
SmolDocling addresses challenges in digital document conversion, particularly semantic parsing of complex layouts and PDFs, which are typically optimized for printing rather than machine readability. By generating DocTags—a new universal markup format—it captures all page elements, including content, structure, and spatial positioning. Unlike traditional methods that rely on large foundation models or manual pipelines, SmolDocling’s 256 million parameters make it 5-10 times smaller than other VLMs, significantly reducing computational complexity.
Research shows it can handle diverse document types, including business documents, academic papers, technical reports, patents, and tables, expanding beyond the typical focus on scientific papers. It excels at managing complex layouts like tables, charts, code blocks, and nested lists, delivering accurate content extraction and layout preservation.
| Feature | Description |
|---|---|
| Parameter Size | 256 million parameters, 5-10x smaller than large VLMs |
| Supported Document Types | Business documents, academic papers, technical reports, patents, tables |
| Key Advantages | High efficiency, low resource demands, end-to-end conversion |
A key reference is SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion, which details its design and performance.
Built on SmolVLM-256M, SmolDocling uses SigLIP base patch-16/512 (93 million parameters) as its vision backbone and the SmolLM-2 series (135 million parameters) as its language backbone. Its training data is rebalanced, with 41% focused on document understanding and 14% on image captioning, utilizing datasets like The Cauldron, Docmatix, and MathWriting.
The workflow involves:
- Encoding input page images via the vision encoder.
- Projecting and pooling images to generate embeddings, concatenated with text embeddings.
- Using a language model (LLM) to autoregressively predict DocTags.
DocTags, an XML-style markup format, represents block types (e.g., text, headings, equations) with nested positional tags (e.g., <loc_x1><loc_y1><loc_x2><loc_y2>) to preserve spatial information. It supports table representation via OTSL (Object Table Structure Language) and classifies code and images.
Training follows a curriculum learning approach: initially freezing the vision encoder to adapt the LLM to DocTags, then unfreezing it for further training on pretraining datasets (e.g., DocLayNet-PT with 1.4 million pages, Docmatix with 1.3 million) and task-specific datasets (e.g., 76,000 tables, 2.5 million charts, 9.3 million code snippets, 5.5 million equations).
| Training Dataset | Scale |
|---|---|
| DocLayNet-PT | 1.4 million pages |
| Docmatix | 1.3 million |
| Tables | 76,000 entries |
| Charts | 2.5 million |
| Code | 9.3 million |
| Equations | 5.5 million |
SmolDocling enables end-to-end document conversion, capturing content, structure, and spatial positioning. It recognizes and reproduces elements like code listings, tables, equations, charts, lists, headings, footnotes, headers/footers, and section titles. The model supports isolated predictions (e.g., cropped elements) and nested structure grouping, such as nested lists.
Performance-wise, it processes pages in 0.35 seconds on consumer-grade GPUs, uses just 0.489 GB of VRAM, supports up to 8,192 tokens, and handles up to 3 pages simultaneously. It competes with larger VLMs (e.g., Qwen2.5 VL 7B, with 27x its parameters) while drastically reducing computational needs.
Additionally, SmolDocling contributes new public datasets for chart, table, equation, and code recognition, expected to be released soon, enhancing its generalization for document understanding tasks.
| Performance Metric | Value |
|---|---|
| Processing Time per Page | 0.35 seconds |
| VRAM Usage | 0.489 GB |
| Max Token Count | 8,192 |
| Max Pages | 3 |
SmolDocling supports various usage methods:
- Python Usage: Use the
convert()function from the Docling library, providing a URL or local file path. Requires libraries liketorch,docling_core, andtransformers. - CLI Usage: Run
docling https://arxiv.org/pdf/2206.01062with the--pipeline vlm --vlm-model smoldoclingflags to use SmolDocling. - Inference Methods: Compatible with Transformers (single-page inference), VLLM (fast batch processing), ONNX, and MLX (for Apple Silicon).
Detailed guides are available at Docling Documentation and SmolDocling-256M-preview.
Official resources include:
- Official Documentation: Installation, usage, concepts, recipes, and extensions at Docling Documentation.
- Examples: Practical examples for different scenarios at Docling Examples, with SmolDocling-specific examples at SmolDocling-256M-preview.
- Demo: Interactive demo at SmolDocling-256M-Demo on Hugging Face Spaces.
These resources support users from beginners to advanced practitioners, ensuring quick adoption and application.
SmolDocling is ideal for:
- Document Digitization: Converting scanned documents or images into structured formats.
- Data Extraction: Extracting data from complex layouts like tables, patents, and technical reports.
- Automated Processing: Automating document tasks in business, academic, and research settings.
It supports full document conversion, chart-to-table conversion, equation-to-LaTeX, code-to-text, table-to-OTSL, position-specific OCR, element recognition, and header/footer detection, making it a robust tool for document understanding.
SmolDocling’s development includes contributions to new datasets (charts, tables, equations, code recognition), expected to be publicly available soon, advancing research and applications in document understanding. Its high efficiency (0.35 seconds per page) makes it particularly appealing for resource-constrained environments.
- SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
- ds4sd/SmolDocling-256M-preview · Hugging Face
- SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
- Paper page - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
- SmolDocling: 256M OCR Model Processes Documents in 0.35s on Consumer GPUs • Tech Explorer
- IBM and Hugging Face Researchers Release SmolDocling: A 256M Open-Source Vision Language Model for Complete Document OCR - MarkTechPost
- Papers Explained 333: SmolDocling | by Ritvik Rastogi | Mar, 2025 | Medium
- Docling's new “SmolDocling-256M” Rocks - DEV Community
- r/MachineLearning on Reddit: [R] SmolDocling: A Compact Vision-Language Model for Complete Document Element Recognition and Markup Generation
- SmolVLM Grows Smaller – Introducing the 256M & 500M Models! - Hugging Face Blog