Historical OCR Pipeline

Overview

This project implements an end-to-end Optical Character Recognition (OCR) system for transcribing 17th-century Spanish printed texts using a CRNN (Convolutional Recurrent Neural Network) architecture with CTC loss.

The system is trained on the RODRIGO corpus (~15,000 line images) and is designed to handle historical typography, ligatures, and diacritics found in early modern manuscripts.

Key Features

End-to-end OCR pipeline using CNN + BiLSTM + CTC Loss
No character-level annotations required (sequence learning via CTC)
Handles historical Spanish text with diacritics and ligatures
Beam search decoding with optional lexicon constraints
LLM-based post-processing for error correction (Gemini / GPT-4o-mini)
Evaluation using CER (Character Error Rate) and WER (Word Error Rate)

Architecture

CRNN (CNN + RNN + CTC)

CNN (VGG-style): Extracts spatial features from text-line images
BiLSTM: Models sequential dependencies in both directions
CTC Loss: Enables alignment-free training for variable-length sequences

Design Choices

Architecture Choice: CRNN with CTC

A CRNN architecture is used because it enables sequence recognition from images without requiring character-level segmentation:

CNN extracts column-wise visual features (strokes, shapes, typography)
BiLSTM captures contextual dependencies and ligatures
CTC loss aligns predictions with text sequences without explicit alignment

LLM Post-Correction

A lightweight LLM (Gemini / GPT-4o-mini) is used as a post-processing step:

Corrects common OCR errors (e.g., rn → m, 0 → o, I → l)
Improves linguistic coherence and spelling
Does not replace the OCR model — only refines outputs

Evaluation Metrics

CER (Character Error Rate):
Measures character-level edit distance
→ Primary metric for OCR performance
WER (Word Error Rate):
Measures word-level errors
→ Sensitive to spacing and segmentation

Evaluation is performed:

Before LLM correction
After LLM correction

Text Region Detection

A morphology + contour-based approach is used to:

Detect main text regions
Remove marginalia, headers, and noise
Ensure model processes only relevant content

Dataset

RODRIGO OCR Dataset (Zenodo)
~15,000 line images from 16th–17th century Spanish texts

▶️ How to Run

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
CRNN_OCR.ipynb		CRNN_OCR.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Historical OCR Pipeline

Overview

Key Features

Architecture

CRNN (CNN + RNN + CTC)

Design Choices

Architecture Choice: CRNN with CTC

LLM Post-Correction

Evaluation Metrics

Text Region Detection

Dataset

▶️ How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Historical OCR Pipeline

Overview

Key Features

Architecture

CRNN (CNN + RNN + CTC)

Design Choices

Architecture Choice: CRNN with CTC

LLM Post-Correction

Evaluation Metrics

Text Region Detection

Dataset

▶️ How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages