Skip to content

codaaiteam/smoldocling-wiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Key Points

  • Research indicates that SmolDocling is an efficient vision-language model designed for document conversion, with only 256 million parameters.
  • It appears capable of handling various document types, such as business documents, academic papers, and technical reports, recognizing elements like tables, charts, and code.
  • Usage methods include Python and command-line interface (CLI), with documentation and examples available on the official website.
  • Evidence shows it processes documents at 0.35 seconds per page on consumer-grade GPUs, offering high efficiency and low resource demands.

Introduction

SmolDocling is a compact vision-language model developed by IBM Research and Hugging Face, designed for end-to-end document conversion. It excels at processing complex digital documents—such as business documents, academic papers, technical reports, patents, and tables—transforming them into structured, machine-readable formats. Unlike traditional approaches that rely on large foundation models or manual pipelines, SmolDocling is a single, efficient model with just 256 million parameters, making it 5-10 times smaller than comparable vision-language models (VLMs) while maintaining competitive performance.

It achieves this by generating DocTags, a novel universal markup format that captures all page elements, including content, structure, and spatial positioning. SmolDocling supports the recognition and reproduction of diverse elements like code listings, tables, equations, charts, lists, headings, and footnotes, making it particularly adept at handling complex layouts and nested structures.


How to Use

SmolDocling offers multiple usage options to suit different user needs:

  • Python Usage: Process documents using the convert() function from the Docling library. For example, you can provide a document URL or local file path for conversion. Required libraries include torch, docling_core, and transformers.
  • CLI Usage: Use the docling command-line tool, e.g., docling https://arxiv.org/pdf/2206.01062 to convert a PDF. To specify SmolDocling, add the --pipeline vlm --vlm-model smoldocling options.
  • Inference Methods: Supports various frameworks, including Transformers for single-page image inference, VLLM for fast batch processing, ONNX inference, and MLX local inference for Apple Silicon.

Further details are available at SmolDocling-256M-preview and Docling Documentation.


Documentation and Examples

Official resources for SmolDocling include:

  • Official Documentation: Covers installation, usage, concepts, recipes, and extensions, accessible at Docling Documentation.
  • Examples: Hands-on examples for various use cases are available at Docling Examples. Specific SmolDocling examples can be found at SmolDocling-256M-preview.
  • Demo: A live demo on Hugging Face Spaces allows users to upload documents and view outputs, accessible at SmolDocling-256M-Demo.

Report

SmolDocling is a compact vision-language model developed collaboratively by IBM Research and Hugging Face for end-to-end multimodal document conversion. Released on March 13, 2025, as of the current date, March 20, 2025, at 11:13 AM PDT, it stands out for its efficiency and low resource requirements, making it ideal for structured content extraction from complex documents. Below is a comprehensive analysis of SmolDocling’s introduction, usage, documentation, capabilities, and use cases.

Introduction and Background

SmolDocling addresses challenges in digital document conversion, particularly semantic parsing of complex layouts and PDFs, which are typically optimized for printing rather than machine readability. By generating DocTags—a new universal markup format—it captures all page elements, including content, structure, and spatial positioning. Unlike traditional methods that rely on large foundation models or manual pipelines, SmolDocling’s 256 million parameters make it 5-10 times smaller than other VLMs, significantly reducing computational complexity.

Research shows it can handle diverse document types, including business documents, academic papers, technical reports, patents, and tables, expanding beyond the typical focus on scientific papers. It excels at managing complex layouts like tables, charts, code blocks, and nested lists, delivering accurate content extraction and layout preservation.

Feature Description
Parameter Size 256 million parameters, 5-10x smaller than large VLMs
Supported Document Types Business documents, academic papers, technical reports, patents, tables
Key Advantages High efficiency, low resource demands, end-to-end conversion

A key reference is SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion, which details its design and performance.

Architecture and Workflow

Built on SmolVLM-256M, SmolDocling uses SigLIP base patch-16/512 (93 million parameters) as its vision backbone and the SmolLM-2 series (135 million parameters) as its language backbone. Its training data is rebalanced, with 41% focused on document understanding and 14% on image captioning, utilizing datasets like The Cauldron, Docmatix, and MathWriting.

The workflow involves:

  1. Encoding input page images via the vision encoder.
  2. Projecting and pooling images to generate embeddings, concatenated with text embeddings.
  3. Using a language model (LLM) to autoregressively predict DocTags.

DocTags, an XML-style markup format, represents block types (e.g., text, headings, equations) with nested positional tags (e.g., <loc_x1><loc_y1><loc_x2><loc_y2>) to preserve spatial information. It supports table representation via OTSL (Object Table Structure Language) and classifies code and images.

Training follows a curriculum learning approach: initially freezing the vision encoder to adapt the LLM to DocTags, then unfreezing it for further training on pretraining datasets (e.g., DocLayNet-PT with 1.4 million pages, Docmatix with 1.3 million) and task-specific datasets (e.g., 76,000 tables, 2.5 million charts, 9.3 million code snippets, 5.5 million equations).

Training Dataset Scale
DocLayNet-PT 1.4 million pages
Docmatix 1.3 million
Tables 76,000 entries
Charts 2.5 million
Code 9.3 million
Equations 5.5 million

Capabilities and Performance

SmolDocling enables end-to-end document conversion, capturing content, structure, and spatial positioning. It recognizes and reproduces elements like code listings, tables, equations, charts, lists, headings, footnotes, headers/footers, and section titles. The model supports isolated predictions (e.g., cropped elements) and nested structure grouping, such as nested lists.

Performance-wise, it processes pages in 0.35 seconds on consumer-grade GPUs, uses just 0.489 GB of VRAM, supports up to 8,192 tokens, and handles up to 3 pages simultaneously. It competes with larger VLMs (e.g., Qwen2.5 VL 7B, with 27x its parameters) while drastically reducing computational needs.

Additionally, SmolDocling contributes new public datasets for chart, table, equation, and code recognition, expected to be released soon, enhancing its generalization for document understanding tasks.

Performance Metric Value
Processing Time per Page 0.35 seconds
VRAM Usage 0.489 GB
Max Token Count 8,192
Max Pages 3

Usage Methods

SmolDocling supports various usage methods:

  • Python Usage: Use the convert() function from the Docling library, providing a URL or local file path. Requires libraries like torch, docling_core, and transformers.
  • CLI Usage: Run docling https://arxiv.org/pdf/2206.01062 with the --pipeline vlm --vlm-model smoldocling flags to use SmolDocling.
  • Inference Methods: Compatible with Transformers (single-page inference), VLLM (fast batch processing), ONNX, and MLX (for Apple Silicon).

Detailed guides are available at Docling Documentation and SmolDocling-256M-preview.

Documentation and Examples

Official resources include:

These resources support users from beginners to advanced practitioners, ensuring quick adoption and application.

Use Cases and Applications

SmolDocling is ideal for:

  • Document Digitization: Converting scanned documents or images into structured formats.
  • Data Extraction: Extracting data from complex layouts like tables, patents, and technical reports.
  • Automated Processing: Automating document tasks in business, academic, and research settings.

It supports full document conversion, chart-to-table conversion, equation-to-LaTeX, code-to-text, table-to-OTSL, position-specific OCR, element recognition, and header/footer detection, making it a robust tool for document understanding.

Additional Information

SmolDocling’s development includes contributions to new datasets (charts, tables, equations, code recognition), expected to be publicly available soon, advancing research and applications in document understanding. Its high efficiency (0.35 seconds per page) makes it particularly appealing for resource-constrained environments.

Key References

About

SmolDocling: Ultra-Compact Document AI Conversion

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors