Benchmark OCR and Vision-Language Models for document understanding tasks.
Research Question: When should practitioners use OCR pipelines versus end-to-end VLMs for document intelligence tasks?
git clone https://github.com/kenza-ily/disco.git
cd disco
uv synccp .env.example .env.local
# Edit .env.local with your API keysYou only need credentials for the providers you want to use:
- Azure OpenAI (GPT-5, Document Intelligence OCR)
- Anthropic Direct API (Claude models - recommended)
- AWS Bedrock (alternative for Claude models)
- Mistral API (Mistral OCR)
Datasets are loaded automatically from HuggingFace Hub:
huggingface-cli login# Option 1: Use unified runner (recommended)
uv run python scripts/run_benchmark.py \
--dataset docvqa \
--models claude_sonnet \
--phases P-B \
--sample-limit 10
# Option 2: Call benchmark directly
uv run python -m benchmarks.dataset_specific.benchmark_docvqa \
--models claude_sonnet \
--phases P-B \
--sample-limit 10Vision-Language Models (VLMs):
- GPT-5 mini (
gpt-5-mini) - Azure OpenAI vision model (primary) - GPT-5 nano (
gpt-5-nano) - Lightweight Azure OpenAI vision model - Claude 3.5 Sonnet (
claude-3-5-sonnet) - Anthropic vision-language model
OCR Systems:
- Azure Document Intelligence (
azure-ai-documentintelligence) - Enterprise-grade OCR with layout analysis - Mistral OCR 2 (
mistral-ocr-2505) - Document parsing with markdown output - Mistral OCR 3 (
mistral-ocr-2512) - Newer Mistral OCR version
- Claude Sonnet / Haiku - Via Anthropic Direct API or AWS Bedrock
- Qwen-VL - Open source vision-language model
- Donut - Open source document understanding
- DeepSeek-OCR - Multilingual OCR model
All datasets load automatically from HuggingFace (no local storage needed). Full collection: kenza-ily/disco
| Dataset | Task | Samples | HuggingFace URL |
|---|---|---|---|
| IAM | Handwriting recognition | 500 | kenza-ily/iam_disco |
| DocVQA | Document question answering | 500 | kenza-ily/docvqa_disco |
| InfographicVQA | Infographic QA | 500 | kenza-ily/infographicvqa_disco |
| DUDE | Diverse documents QA | 404 | kenza-ily/dude_disco |
| ChartQA Pro | Chart question answering | 494 | kenza-ily/chartqapro_disco |
| PubLayNet | Document layout parsing | 500 | kenza-ily/publaynet_disco |
| VisRBench | Visual reasoning (multi-page) | 498 | kenza-ily/visrbench_disco |
| ICDAR | Multilingual OCR (10 languages) | 500 | kenza-ily/icdar_disco |
| RxPad | Medical prescription parsing (French) | 200 | kenza-ily/rxpad_disco |
The evaluation separates text parsing from downstream question answering across three pipeline architectures:
| Paper Name | Code Phase | Description |
|---|---|---|
| P_OCR | P-A | Pure OCR baseline — specialized OCR extracts text |
| P_VLM-base | P-B | VLM with generic text extraction prompt |
| P_VLM-task | P-C | VLM with task-aware, domain-specific prompt |
| Paper Name | Code Phase | Description |
|---|---|---|
| QA_OCR | QA1a/QA1b/QA1c | Specialized OCR → LLM reasoning (with simple, detailed, or CoT prompt) |
| QA_VLM-2stage | QA2a/QA2b | VLM extracts text → VLM performs QA (two-stage) |
| QA_VLM-direct | QA3a | Single-step VLM answers directly from the image |
| Multi-page retrieval | QA4a | Multi-page with retrieval (VisRBench only) |
# Run all phases with Claude Sonnet
uv run python -m benchmarks.benchmark_docvqa \
--models claude_sonnet \
--phases P-A P-B P-C \
--sample-limit 50uv run python -m benchmarks.benchmark_publaynet \
--models azure_intelligence gpt5_mini claude_sonnet \
--phases P-B \
--sample-limit 100Empirical guidance from our evaluation across 9 datasets:
| Document Type | Recommended Approach | Notes |
|---|---|---|
| Handwritten text (IAM) | OCR pipeline | VLMs lag by 5–9% CER even with task-aware prompting |
| Multilingual documents (ICDAR) | VLM (generic prompt) | 87% CER reduction vs OCR; OCR fails on non-Latin scripts |
| Single-page visual QA (DocVQA, InfographicVQA) | Direct VQA | Highest GT-in-Pred (~0.91); fewer error propagation stages |
| Multi-page documents (DUDE) | OCR pipeline | More reliable text grounding; VLMs struggle with long context |
| Medical prescriptions (RxPad) | Either | Similar accuracy; VLMs produce structured key-value output |
Additional insights:
- Task-aware prompting yields heterogeneous effects — substantially improves multilingual parsing but can degrade performance on diverse inputs
- OCR system selection matters: Azure Document Intelligence consistently outperforms Mistral OCR on structured documents
- Mistral OCR 3 (2512) shows a 23-point regression vs Mistral OCR 2 (2505) on DocVQA — a newer version number does not guarantee improvement
- Direct VQA achieves the best speed-accuracy frontier (0.87–0.91 GT-in-Pred, 4–10s latency vs 17–35s for two-stage pipelines)
Results are saved in results/ with a structured pipeline:
results/
├── 0_info/ # Pipeline scripts and documentation
├── 1_raw/ # Raw experimental outputs (CSV per model)
├── 2_clean/ # Consolidated results (CSV per phase)
├── 3_notebooks/ # Analysis notebooks
└── 4_postprocessing/ # Aggregated statistics
Run full analysis pipeline:
cd results
python 0_info/run_pipeline.py --fullOpen analysis notebooks:
cd results/3_notebooks
jupyter notebook 00_master_evaluation.ipynb- benchmarks/ - Benchmark scripts for all datasets
benchmark.py- Base frameworkdataset_specific/- Dataset implementations
- datasets/ - HuggingFace dataset loaders
- metrics/ - Evaluation metrics (CER, WER, ANLS)
- models/ - Unified API for OCR + VLM models
- prompts/ - Task-specific prompts
- results/ - Benchmark results and analysis
0_info/- Pipeline documentation1_raw/- Raw benchmark outputs2_clean/- Consolidated results3_embeddings/- Pre-computed embeddings
- scripts/ - Utility scripts
run_benchmark.py- Unified benchmark runner
- utils/ - Shared utilities
- archive/ - Archived code and old experiments
- GT-in-Pred (Ground-Truth-in-Prediction) - Primary QA metric; binary indicator whether the ground-truth answer appears in the model's prediction
- CER (Character Error Rate) - Character-level edit distance; primary parsing metric
- WER (Word Error Rate) - Word-level edit distance
- ANLS (Average Normalized Levenshtein Similarity) - String similarity for format compliance
- Cosine Similarity - Semantic similarity using embeddings
- EM (Exact Match) - Binary exact match
- Substring Match - Fuzzy matching for VQA
# Run tests
make test
make lint
# Type checking
make typecheck
# Run all quality checks
make allAZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-key-here
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-key-hereANTHROPIC_API_KEY=sk-ant-your-key-hereAWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=us-east-1MISTRAL_API_KEY=your-key-hereIf you use this benchmark suite in your research, please cite:
@misc{benkirane2026disco,
title={DISCO: Document Intelligence Suite for COmparative Evaluation},
author={Benkirane, Kenza and Goldwater, Dan and Asenov, Martin and Ghodsi, Aneiss},
year={2026},
note={ICLR 2026 submission},
url={https://github.com/kenza-ily/disco}
}MIT License - See LICENSE file for details.