This project provides an end-to-end pipeline that aligns scanned receipts to a canonical template, extracts text from predefined regions of interest (ROIs) using Tesseract OCR, and parses the text into structured JSON.
.
├── config/roi_config.json # Canvas size and ROI definitions
├── data/ # Place the canonical template used for alignment
├── examples/ # Ground-truth JSON and (locally supplied) sample receipts
├── notebooks/ # Analysis notebooks (e.g., evaluation)
├── src/receipt_ocr/ # Python package implementation
├── tests/ # Unit and integration tests
└── tools/ # Utility scripts such as evaluation
-
Install system dependencies (required for OpenCV and the Tesseract CLI):
sudo apt-get update sudo apt-get install -y libgl1 tesseract-ocr
-
Install Python dependencies:
python -m venv .venv source .venv/bin/activate pip install -e .
-
Capture a high-resolution template image (e.g.,
data/template.jpg) by placing a blank sample receipt on a flat surface, ensuring even lighting. Store the file in thedata/directory—this repository does not ship binary assets by default. Align all subsequent receipts in a similar orientation when scanning or photographing.Similarly, supply your own receipt scans for evaluation in
examples/; only lightweight JSON annotations are tracked in Git to avoid binary file handling issues when opening pull requests. -
Tune ROIs in
config/roi_config.jsonby opening the template in an image editor and measuring the pixel coordinates for each field. Update the JSON file with thex,y,w, andhvalues as well as field-specific Tesseract page segmentation modes (psm). -
Run the pipeline on a receipt:
python -m receipt_ocr.cli --image path/to/scan.jpg --config config/roi_config.json --template data/template.jpg --output receipt.json
The CLI prints or saves the extracted JSON containing
vendor,date,total, and an array ofitemswithname,qty, andpricefields.
A helper script downloads an openly published sample receipt from the Azure Form Recognizer SDK test corpus. This keeps the Git history binary-free while still offering a reproducible demo.
python tools/fetch_sample_receipt.py \
--template data/sample_template.jpg \
--receipt examples/sample_receipt.jpg
python -m receipt_ocr.cli \
--image examples/sample_receipt.jpg \
--config config/roi_config.json \
--template data/sample_template.jpg \
--output sample_receipt.jsonThe helper downloads the same image for both template and receipt inputs so the default ORB-based alignment succeeds out of the box. Replace these files with your own scans when calibrating ROIs for production usage.
Note: The default
config/roi_config.jsonis calibrated for the downloaded Contoso Cafe sample (native resolution 1688×3000). If you swap in a differently sized template, update the canvas dimensions and ROI coordinates accordingly.
Use the evaluation script to score OCR accuracy on a held-out dataset (after placing your template image in data/ and local
receipt scans in examples/):
python tools/evaluate.py --dataset examples/ --config config/roi_config.json --template data/template.jpgThe script aggregates field-wise accuracy, highlights misreads, and outputs a summary report. For exploratory analysis or visualization, create notebooks in notebooks/ such as notebooks/evaluate.ipynb to inspect failure cases and tune ROIs.
Tests mock OCR calls to keep results deterministic. Run them with:
pytestContributions are welcome! Please include updates to documentation, configuration, and tests when introducing new ROIs or changing parsing logic.