Skip to content

Extralit/papers-ocr-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Benchmarking System for Scientific Literature

This repository contains a comprehensive benchmarking framework that evaluates and compares OCR systems on scientific literature. The project successfully benchmarks 3 OCR systems (Marker, Docling, PyMuPDF) with Marker achieving 82.6% accuracy as the best performer.

🎯 Project Status: ✅ COMPLETE

Deliverables Completed:

  • 3 OCR systems benchmarked (Marker + 2 others as requested)
  • Google Colab Notebook with findings and analysis
  • Comprehensive metrics for scientific document processing
  • Processing time analysis and performance comparison
  • Real scientific literature used as test dataset

📊 Key Results

OCR System Character Accuracy Processing Time Best For
Marker 82.6% 39s avg Best Overall Performance
Docling 81.3% 91s avg Scientific Content Analysis
PyMuPDF Baseline <1s Speed Baseline

🔍 Evaluation Metrics

The benchmarking system evaluates OCR systems across multiple dimensions:

Content Accuracy Metrics

  • Text Extraction Accuracy: Character, word, and line-level accuracy
  • Processing Performance: Speed, efficiency, resource usage
  • Scientific Content Handling: Formulas, notation, references, citations

Structure Parsing Metrics 🆕

  • Document Elements: Title, authors, abstract detection
  • Section Organization: Headers, paragraphs, reading order
  • Scientific Elements: Equations, tables, figures, references
  • Layout Preservation: Document hierarchy and structure

🔍 OCR Systems Benchmarked

Successfully Tested ✅

  • Marker ⭐ - Priority system (82.6% accuracy, 39s avg processing)
  • Docling - IBM's document AI (81.3% accuracy, 91s avg processing)
  • PyMuPDF - Direct text extraction baseline (<1s processing)

Result: Marker demonstrates the best overall performance for scientific literature OCR, balancing high accuracy with reasonable processing speed.

📂 Repository Structure

papers-ocr-benchmarks/
├── 📓 OCR_Benchmark_Scientific_Literature.ipynb  # 🎯 Google Colab Notebook (MAIN DELIVERABLE)
├── 📄 README.md                                  # Project documentation
├── 📋 requirements.txt                           # Python dependencies
│
├── 📁 scripts/                                   # Benchmark scripts
│   ├── ocr_benchmark_gpu_optimized.py           # Main benchmark script
│   └── setup_gpu_environment.py                 # Environment setup
│
├── 📁 pdfs/                                      # Test dataset (3 scientific papers)
│   ├── 2014-Combining_organophosphate_treated_wall_linings...pdf
│   ├── Allossogbe_et_al_2017_Mal_J.pdf
│   └── Somboon_et_al_1995_Trans_RSTMH.pdf
│
├── 📁 results/                                   # Benchmark results
│   ├── benchmark_results.csv                    # Detailed metrics
│   ├── benchmark_summary.csv                    # Summary statistics
│   ├── benchmark_visualization.png              # Performance charts
│   └── [individual_ocr_outputs].txt             # Raw OCR extractions
│
├── 📁 docs/                                      # Project documentation
│   └── Enhanced AI OCR Extraction Pipeline for Scientific Literature.md
│
└── 📁 output/                                    # Generated results directory

🚀 Quick Start

For Review (Recommended)

  1. View Results: Open OCR_Benchmark_Scientific_Literature.ipynb in Google Colab
  2. Check Metrics: Review results/benchmark_summary.csv
  3. Inspect Outputs: Examine individual OCR outputs in results/

To Run Benchmark Locally

# 1. Setup environment
python scripts/setup_gpu_environment.py

# 2. Run benchmark
python scripts/ocr_benchmark_gpu_optimized.py

# 3. Analyze document structure (NEW)
python scripts/structure_parser.py

# 4. Check results in results/ and examples/outputs/ directories

📊 Key Files

Main Deliverables

  • OCR_Benchmark_Scientific_Literature.ipynb - Google Colab notebook with complete analysis
  • results/latest_benchmark_results.csv - Latest benchmark results summary
  • examples/outputs/structure_comparison.csv - Structure parsing analysis
  • STRUCTURE_ANALYSIS_REPORT.md - Detailed structure parsing evaluation

Scripts

  • scripts/ocr_benchmark_gpu_optimized.py - Main benchmark script with GPU optimization
  • scripts/structure_parser.py - Document structure analysis tool
  • scripts/setup_gpu_environment.py - Environment setup and dependency checking

Data & Results

  • pdfs/ - Test dataset of 3 scientific papers
  • results/ - Complete benchmark results and individual OCR outputs
  • examples/outputs/ - Structured JSON outputs for each OCR system

📈 Benchmark Results Summary

Performance Metrics

  • Best Accuracy: Marker (82.6% character accuracy)
  • Fastest Processing: PyMuPDF (<1s per document)
  • Best Balance: Marker (high accuracy + reasonable speed)

Scientific Content Analysis

  • Citations Detected: All systems successfully identify reference citations
  • Figures/Tables: Good preservation of figure and table references
  • Mathematical Content: Basic formula detection implemented

Processing Time Analysis

  • Marker: 39s average (best AI-based performance)
  • Docling: 91s average (thorough but slower)
  • PyMuPDF: <1s (direct text extraction baseline)

🔧 Technical Details

Dependencies

pip install -r requirements.txt

Key packages:

  • marker-pdf - Marker OCR system
  • docling - IBM Docling system
  • PyMuPDF - PDF text extraction
  • pandas, numpy - Data analysis
  • torch - GPU acceleration (optional)

System Requirements

  • Python 3.8+
  • 8GB+ RAM recommended
  • GPU optional (automatic CPU fallback)
  • 5GB+ storage for results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •