This repository contains a comprehensive benchmarking framework that evaluates and compares OCR systems on scientific literature. The project successfully benchmarks 3 OCR systems (Marker, Docling, PyMuPDF) with Marker achieving 82.6% accuracy as the best performer.
Deliverables Completed:
- ✅ 3 OCR systems benchmarked (Marker + 2 others as requested)
- ✅ Google Colab Notebook with findings and analysis
- ✅ Comprehensive metrics for scientific document processing
- ✅ Processing time analysis and performance comparison
- ✅ Real scientific literature used as test dataset
| OCR System | Character Accuracy | Processing Time | Best For |
|---|---|---|---|
| Marker ⭐ | 82.6% | 39s avg | Best Overall Performance |
| Docling | 81.3% | 91s avg | Scientific Content Analysis |
| PyMuPDF | Baseline | <1s | Speed Baseline |
The benchmarking system evaluates OCR systems across multiple dimensions:
- Text Extraction Accuracy: Character, word, and line-level accuracy
- Processing Performance: Speed, efficiency, resource usage
- Scientific Content Handling: Formulas, notation, references, citations
- Document Elements: Title, authors, abstract detection
- Section Organization: Headers, paragraphs, reading order
- Scientific Elements: Equations, tables, figures, references
- Layout Preservation: Document hierarchy and structure
- Marker ⭐ - Priority system (82.6% accuracy, 39s avg processing)
- Docling - IBM's document AI (81.3% accuracy, 91s avg processing)
- PyMuPDF - Direct text extraction baseline (<1s processing)
Result: Marker demonstrates the best overall performance for scientific literature OCR, balancing high accuracy with reasonable processing speed.
papers-ocr-benchmarks/
├── 📓 OCR_Benchmark_Scientific_Literature.ipynb # 🎯 Google Colab Notebook (MAIN DELIVERABLE)
├── 📄 README.md # Project documentation
├── 📋 requirements.txt # Python dependencies
│
├── 📁 scripts/ # Benchmark scripts
│ ├── ocr_benchmark_gpu_optimized.py # Main benchmark script
│ └── setup_gpu_environment.py # Environment setup
│
├── 📁 pdfs/ # Test dataset (3 scientific papers)
│ ├── 2014-Combining_organophosphate_treated_wall_linings...pdf
│ ├── Allossogbe_et_al_2017_Mal_J.pdf
│ └── Somboon_et_al_1995_Trans_RSTMH.pdf
│
├── 📁 results/ # Benchmark results
│ ├── benchmark_results.csv # Detailed metrics
│ ├── benchmark_summary.csv # Summary statistics
│ ├── benchmark_visualization.png # Performance charts
│ └── [individual_ocr_outputs].txt # Raw OCR extractions
│
├── 📁 docs/ # Project documentation
│ └── Enhanced AI OCR Extraction Pipeline for Scientific Literature.md
│
└── 📁 output/ # Generated results directory
- View Results: Open
OCR_Benchmark_Scientific_Literature.ipynbin Google Colab - Check Metrics: Review
results/benchmark_summary.csv - Inspect Outputs: Examine individual OCR outputs in
results/
# 1. Setup environment
python scripts/setup_gpu_environment.py
# 2. Run benchmark
python scripts/ocr_benchmark_gpu_optimized.py
# 3. Analyze document structure (NEW)
python scripts/structure_parser.py
# 4. Check results in results/ and examples/outputs/ directoriesOCR_Benchmark_Scientific_Literature.ipynb- Google Colab notebook with complete analysisresults/latest_benchmark_results.csv- Latest benchmark results summaryexamples/outputs/structure_comparison.csv- Structure parsing analysisSTRUCTURE_ANALYSIS_REPORT.md- Detailed structure parsing evaluation
scripts/ocr_benchmark_gpu_optimized.py- Main benchmark script with GPU optimizationscripts/structure_parser.py- Document structure analysis toolscripts/setup_gpu_environment.py- Environment setup and dependency checking
pdfs/- Test dataset of 3 scientific papersresults/- Complete benchmark results and individual OCR outputsexamples/outputs/- Structured JSON outputs for each OCR system
- Best Accuracy: Marker (82.6% character accuracy)
- Fastest Processing: PyMuPDF (<1s per document)
- Best Balance: Marker (high accuracy + reasonable speed)
- Citations Detected: All systems successfully identify reference citations
- Figures/Tables: Good preservation of figure and table references
- Mathematical Content: Basic formula detection implemented
- Marker: 39s average (best AI-based performance)
- Docling: 91s average (thorough but slower)
- PyMuPDF: <1s (direct text extraction baseline)
pip install -r requirements.txtKey packages:
marker-pdf- Marker OCR systemdocling- IBM Docling systemPyMuPDF- PDF text extractionpandas,numpy- Data analysistorch- GPU acceleration (optional)
- Python 3.8+
- 8GB+ RAM recommended
- GPU optional (automatic CPU fallback)
- 5GB+ storage for results