A comprehensive evaluation framework for comparing OCR model outputs against Turkish Real-World data. This tool provides detailed metrics for text, equations, tables, and images extracted from documents.
- Text Metrics: Normalized Edit Distance (NED) and Turkish character similarity
- Equation Metrics: BLEU-4, Character Dice Metric (CDM), and NED for LaTeX equations
- Table Metrics: NED and TEDS-like similarity for extracted tables
- Image Metrics: MSE, and DreamSim
pip install -r requirements.txtpython eval.py <ground_truth_path> <model_output_path> [results_path]python eval.py <ground_truth_path> <model_output_path> [results_path] --imagespython eval.py ./data/ground_truth ./data/model_outputs ./results --imagesground_truth/
├── data_1/
│ ├── document.md
│ └── figures/
│ ├── figure_1.png
│ └── figure_2.png
├── data_2/
│ ├── document.md
│ └── figures/
│ └── figure_1.png
└── ...
model_outputs/
├── data_1/
│ ├── result.md (or document.md)
│ └── images/ (or fig/, imgs/)
│ ├── figure_1.png
│ └── figure_2.png
└── ...
results/
├── per_doc_metrics.csv # Metrics for each document
├── per_image_metrics.csv # Metrics for each image pair
└── summary_metrics.csv # Aggregated summary statistics
- NED (Normalized Edit Distance): Levenshtein distance normalized by length (lower is better, 0 = perfect match)
- Turkish Character Similarity: Specialized metric for Turkish diacritics (higher is better, 1 = perfect)
- BLEU-4: Standard BLEU score for LaTeX equations (higher is better, 1 = perfect)
- CDM (Character Dice Metric): F1-like metric for character overlap (higher is better, 1 = perfect)
- Equation NED: Edit distance for LaTeX strings (lower is better)
- Table NED: Edit distance on CSV-serialized tables (lower is better)
- TEDS-like: Tree Edit Distance-based similarity for table structure (higher is better, 1 = perfect)
- MSE: Mean Squared Error (lower is better)
- DreamSim: Perceptual similarity metric (lower is better)
Per-document metrics including:
- Text NED and Turkish character similarity
- Equation metrics (NED, BLEU, CDM)
- Table metrics (NED, TEDS)
- Image metrics
- Counts of extracted elements
Per-image-pair metrics:
- MSE, Dreamsim for each image pair
- Source file paths
Aggregated statistics across all documents and images.
If you use this tool in your research, please cite:
@misc{yılmaz2026ocrturkcomprehensiveocrbenchmark,
title={OCRTurk: A Comprehensive OCR Benchmark for Turkish},
author={Deniz Yılmaz and Evren Ayberk Munis and Çağrı Toraman and Süha Kağan Köse and Burak Aktaş and Mehmet Can Baytekin and Bilge Kaan Görür},
year={2026},
eprint={2602.03693},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.03693},
}