OCR Benchmarking System for Scientific Literature

This repository contains a comprehensive benchmarking framework that evaluates and compares OCR systems on scientific literature. The project successfully benchmarks 3 OCR systems (Marker, Docling, PyMuPDF) with Marker achieving 82.6% accuracy as the best performer.

🎯 Project Status: ✅ COMPLETE

Deliverables Completed:

✅ 3 OCR systems benchmarked (Marker + 2 others as requested)
✅ Google Colab Notebook with findings and analysis
✅ Comprehensive metrics for scientific document processing
✅ Processing time analysis and performance comparison
✅ Real scientific literature used as test dataset

📊 Key Results

OCR System	Character Accuracy	Processing Time	Best For
Marker ⭐	82.6%	39s avg	Best Overall Performance
Docling	81.3%	91s avg	Scientific Content Analysis
PyMuPDF	Baseline	<1s	Speed Baseline

🔍 Evaluation Metrics

The benchmarking system evaluates OCR systems across multiple dimensions:

Content Accuracy Metrics

Text Extraction Accuracy: Character, word, and line-level accuracy
Processing Performance: Speed, efficiency, resource usage
Scientific Content Handling: Formulas, notation, references, citations

Structure Parsing Metrics 🆕

Document Elements: Title, authors, abstract detection
Section Organization: Headers, paragraphs, reading order
Scientific Elements: Equations, tables, figures, references
Layout Preservation: Document hierarchy and structure

🔍 OCR Systems Benchmarked

Successfully Tested ✅

Marker ⭐ - Priority system (82.6% accuracy, 39s avg processing)
Docling - IBM's document AI (81.3% accuracy, 91s avg processing)
PyMuPDF - Direct text extraction baseline (<1s processing)

Result: Marker demonstrates the best overall performance for scientific literature OCR, balancing high accuracy with reasonable processing speed.

📂 Repository Structure

papers-ocr-benchmarks/
├── 📓 OCR_Benchmark_Scientific_Literature.ipynb  # 🎯 Google Colab Notebook (MAIN DELIVERABLE)
├── 📄 README.md                                  # Project documentation
├── 📋 requirements.txt                           # Python dependencies
│
├── 📁 scripts/                                   # Benchmark scripts
│   ├── ocr_benchmark_gpu_optimized.py           # Main benchmark script
│   └── setup_gpu_environment.py                 # Environment setup
│
├── 📁 pdfs/                                      # Test dataset (3 scientific papers)
│   ├── 2014-Combining_organophosphate_treated_wall_linings...pdf
│   ├── Allossogbe_et_al_2017_Mal_J.pdf
│   └── Somboon_et_al_1995_Trans_RSTMH.pdf
│
├── 📁 results/                                   # Benchmark results
│   ├── benchmark_results.csv                    # Detailed metrics
│   ├── benchmark_summary.csv                    # Summary statistics
│   ├── benchmark_visualization.png              # Performance charts
│   └── [individual_ocr_outputs].txt             # Raw OCR extractions
│
├── 📁 docs/                                      # Project documentation
│   └── Enhanced AI OCR Extraction Pipeline for Scientific Literature.md
│
└── 📁 output/                                    # Generated results directory

🚀 Quick Start

For Review (Recommended)

View Results: Open OCR_Benchmark_Scientific_Literature.ipynb in Google Colab
Check Metrics: Review results/benchmark_summary.csv
Inspect Outputs: Examine individual OCR outputs in results/

To Run Benchmark Locally

# 1. Setup environment
python scripts/setup_gpu_environment.py

# 2. Run benchmark
python scripts/ocr_benchmark_gpu_optimized.py

# 3. Analyze document structure (NEW)
python scripts/structure_parser.py

# 4. Check results in results/ and examples/outputs/ directories

📊 Key Files

Main Deliverables

OCR_Benchmark_Scientific_Literature.ipynb - Google Colab notebook with complete analysis
results/latest_benchmark_results.csv - Latest benchmark results summary
examples/outputs/structure_comparison.csv - Structure parsing analysis
STRUCTURE_ANALYSIS_REPORT.md - Detailed structure parsing evaluation

Scripts

scripts/ocr_benchmark_gpu_optimized.py - Main benchmark script with GPU optimization
scripts/structure_parser.py - Document structure analysis tool
scripts/setup_gpu_environment.py - Environment setup and dependency checking

Data & Results

pdfs/ - Test dataset of 3 scientific papers
results/ - Complete benchmark results and individual OCR outputs
examples/outputs/ - Structured JSON outputs for each OCR system

📈 Benchmark Results Summary

Performance Metrics

Best Accuracy: Marker (82.6% character accuracy)
Fastest Processing: PyMuPDF (<1s per document)
Best Balance: Marker (high accuracy + reasonable speed)

Scientific Content Analysis

Citations Detected: All systems successfully identify reference citations
Figures/Tables: Good preservation of figure and table references
Mathematical Content: Basic formula detection implemented

Processing Time Analysis

Marker: 39s average (best AI-based performance)
Docling: 91s average (thorough but slower)
PyMuPDF: <1s (direct text extraction baseline)

🔧 Technical Details

Dependencies

pip install -r requirements.txt

Key packages:

marker-pdf - Marker OCR system
docling - IBM Docling system
PyMuPDF - PDF text extraction
pandas, numpy - Data analysis
torch - GPU acceleration (optional)

System Requirements

Python 3.8+
8GB+ RAM recommended
GPU optional (automatic CPU fallback)
5GB+ storage for results

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
383-pdfs		383-pdfs
examples/outputs		examples/outputs
output_markdown		output_markdown
pdfs		pdfs
scripts		scripts
.gitignore		.gitignore
OCR_Benchmark_GPU_Optimized.ipynb		OCR_Benchmark_GPU_Optimized.ipynb
OCR_Benchmark_Scientific_Literature.ipynb		OCR_Benchmark_Scientific_Literature.ipynb
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
STRUCTURE_ANALYSIS_REPORT.md		STRUCTURE_ANALYSIS_REPORT.md
detection-383-pdf.ipynb		detection-383-pdf.ipynb
enhanced_pymupdf4llm_pipeline.ipynb		enhanced_pymupdf4llm_pipeline.ipynb
preProcessing_LlamaMarkdown.ipynb		preProcessing_LlamaMarkdown.ipynb
prep_chunk.ipynb		prep_chunk.ipynb
pymupdfLLM.ipynb		pymupdfLLM.ipynb
pymupdfllm_pipeline.ipynb		pymupdfllm_pipeline.ipynb
pyproject.toml		pyproject.toml
text_ocr_doc_structure.ipynb		text_ocr_doc_structure.ipynb
tuning.ipynb		tuning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR Benchmarking System for Scientific Literature

🎯 Project Status: ✅ COMPLETE

📊 Key Results

🔍 Evaluation Metrics

Content Accuracy Metrics

Structure Parsing Metrics 🆕

🔍 OCR Systems Benchmarked

Successfully Tested ✅

📂 Repository Structure

🚀 Quick Start

For Review (Recommended)

To Run Benchmark Locally

📊 Key Files

Main Deliverables

Scripts

Data & Results

📈 Benchmark Results Summary

Performance Metrics

Scientific Content Analysis

Processing Time Analysis

🔧 Technical Details

Dependencies

System Requirements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Extralit/papers-ocr-benchmarks

Folders and files

Latest commit

History

Repository files navigation

OCR Benchmarking System for Scientific Literature

🎯 Project Status: ✅ COMPLETE

📊 Key Results

🔍 Evaluation Metrics

Content Accuracy Metrics

Structure Parsing Metrics 🆕

🔍 OCR Systems Benchmarked

Successfully Tested ✅

📂 Repository Structure

🚀 Quick Start

For Review (Recommended)

To Run Benchmark Locally

📊 Key Files

Main Deliverables

Scripts

Data & Results

📈 Benchmark Results Summary

Performance Metrics

Scientific Content Analysis

Processing Time Analysis

🔧 Technical Details

Dependencies

System Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages