OCR_LOCAL

Hook & Vision

OCR_LOCAL is a forensic-grade OCR platform for high-volume, mixed-format document processing. It turns scans, PDFs, and images into searchable, auditable outputs with deterministic fallback behavior, language-aware OCR routing, and optional enrichment layers.

The project exists to solve three hard problems at once:

Preserve evidentiary integrity when OCR quality is inconsistent.
Process large document sets without losing recoverability after failures.
Serve both local batch workflows and distributed queue-based workloads.

Note

This docs suite reflects the broader canonical docs refresh for the current branch.

Architecture Overview

flowchart TB
    subgraph Ingestion["Ingestion"]
        FS["ocr_source/"] --> S1["Scheduler"]
        API["FastAPI /api"] --> S1
    end

    subgraph Pipeline["Async Pipeline"]
        S1 --> S2["CPU Extractors"]
        S2 --> S3["GPU OCR Workers"]
        S3 --> S4["Assembler"]
        S4 --> S5["Ghostscript Compression"]
        S5 --> OUT["ocr_output/EXPORT"]
    end

    subgraph Sidecars["Feature Sidecars"]
        S3 --> NER["NER / Extraction"]
        S3 --> BAR["Barcode / OMR"]
        S3 --> ML["LayoutLMv3 / Embeddings"]
    end

    subgraph Distributed["Distributed Coordinator"]
        D1["Django + Celery"] --> D2["RabbitMQ"]
        D1 --> D3["Redis"]
        D1 --> D4["PostgreSQL"]
    end

Tech Stack DNA

Layer	Stack
Core OCR	PaddleOCR, PaddlePaddle, Tesseract fallback
Processing	PyMuPDF, pdf2image, Pillow, OpenCV, NumPy
Language Routing	FastText (`lid.176.bin`)
API	FastAPI, SQLAlchemy, SlowAPI, WebSockets
Distributed Mode	Django, Celery, RabbitMQ, Redis, PostgreSQL
Observability	OpenTelemetry, Prometheus, Grafana
Runtime	Docker, Kubernetes, NVIDIA Container Toolkit, Ghostscript

Quickstart

1) Prerequisites

Requirement	Why
Docker + Compose	Primary runtime path
NVIDIA drivers + Toolkit	GPU acceleration for PaddleOCR
20+ GB free disk	Model cache and output artifacts

2) Build and start

docker compose up -d --build
docker ps --filter "name=ocr_gpu_processor"

3) Drop input files

Place PDFs/images into ocr_source/ (subfolders are supported).

4) Monitor processing

docker logs -f ocr_gpu_processor

5) Collect results

Artifacts are written under ocr_output/EXPORT/:

PDF/ searchable PDFs
TEXT/ plain text
STRUCTURE/, NER/, EXTRACTION/, CLASSIFICATION/, HANDWRITING/, VALIDATION/

Tip

See docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for exact environment variables and distributed startup.

Information Flow

flowchart LR
    A[ocr_source] --> B[Scheduler]
    B --> C[CPU Extractors]
    C --> D[GPU OCR Workers]
    D --> E[Assembler]
    E --> F[Ghostscript Compressors]
    F --> G[ocr_output/EXPORT]
    D --> H[Feature Sidecars]
    H --> G

Use Cases

User Type	Primary Outcome	Typical Path
Developer	Integrate OCR into apps	FastAPI endpoints + SDKs
Operations Engineer	Scale throughput	Django/Celery distributed coordinator
Forensic Analyst	Preserve legal defensibility	Chain-of-custody + image-only fallback
Data Team	Extract structured signals	NER + classification + extraction sidecars

Transform and Stamping Support

Post-OCR document modification capabilities for forensic workflows:

Transforms: PDF page operations, format conversion, and preprocessing
Stamps: Bates numbering and confidentiality designation overlays
Forensic Safeguards: Custody logging, validation gates, hash-linked chains

Enable via feature flags:

environment:
  - ENABLE_TRANSFORMS=true
  - ENABLE_STAMPING=true

See docs/07-TRANSFORMS-STAMPING.md for API endpoints and operator workflows.

Documentation Suite

Document	Purpose
docs/README.md	Navigation hub
docs/EXECUTIVE-SUMMARY.md	Stakeholder-level product summary
docs/00-SYSTEM-BLUEPRINT.md	Architecture baseline and boundaries
docs/01-TECH-STACK-DNA.md	Technologies, dependencies, and roles
docs/02-QUICKSTART-5-MINUTE-SUCCESS.md	Step-by-step setup
docs/03-INFORMATION-FLOWS.md	End-to-end data movement and API flow
docs/04-USE-CASES.md	Role-based scenarios
docs/05-INTERACTIVE-WALKTHROUGH.md	Guided codebase tour and entry points
docs/06-CONFIGURATION-REFERENCE.md	Environment variables and feature flags
docs/07-TRANSFORMS-STAMPING.md	Transform and stamp operations guide
docs/08-SDK-REFERENCE.md	SDK clients
docs/09-TROUBLESHOOTING.md	Common issues and logs
docs/10-MONITORING-OPERATIONS.md	Observability and operations
docs/11-ML-TRAINING-GUIDE.md	Model training and customization

License

Internal / Proprietary. Built on top of open-source dependencies including PaddleOCR (Apache 2.0).

Name		Name	Last commit message	Last commit date
Latest commit History 619 Commits
.github		.github
api		api
benchmark_results		benchmark_results
coordinator		coordinator
dashboard		dashboard
docs		docs
helm/ocr-local		helm/ocr-local
kafka		kafka
legacy		legacy
ocr_distributed		ocr_distributed
otel		otel
playwright		playwright
reprocess		reprocess
scripts		scripts
sdk		sdk
terraform		terraform
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE		LICENSE
OCR_GPU.py		OCR_GPU.py
README.md		README.md
SECURITY.md		SECURITY.md
adaptive_batch.py		adaptive_batch.py
advanced_preprocessing.py		advanced_preprocessing.py
barcode_extraction.py		barcode_extraction.py
barcode_pipeline.py		barcode_pipeline.py
benchmark_ocr.py		benchmark_ocr.py
benchmark_pipeline.py		benchmark_pipeline.py
classification.py		classification.py
cost_tracking.py		cost_tracking.py
credential_manager.py		credential_manager.py
custody.py		custody.py
custody_hooks.py		custody_hooks.py
docker-compose.yml		docker-compose.yml
docs_architect_orchestrator.py		docs_architect_orchestrator.py
download_models.py		download_models.py
dpi_escalation.py		dpi_escalation.py
easyocr_engine.py		easyocr_engine.py
embedding_service.py		embedding_service.py
engine_selection.py		engine_selection.py
entity_consolidator.py		entity_consolidator.py
extraction.py		extraction.py
file-watcher.yaml.example		file-watcher.yaml.example
file_watcher.py		file_watcher.py
file_watcher_config.py		file_watcher_config.py
file_watcher_remote.py		file_watcher_remote.py
font_selector.py		font_selector.py
format_loaders.py		format_loaders.py
gpu_optimization.py		gpu_optimization.py
handwriting.py		handwriting.py
healthcheck.sh		healthcheck.sh
language_config.py		language_config.py
layoutlm_calibration.py		layoutlm_calibration.py
layoutlm_data.py		layoutlm_data.py
layoutlm_evaluate.py		layoutlm_evaluate.py
layoutlm_finetune.py		layoutlm_finetune.py
layoutlm_labels.py		layoutlm_labels.py
layoutlm_model_registry.py		layoutlm_model_registry.py
layoutlm_structure.py		layoutlm_structure.py
layoutlm_summarization.py		layoutlm_summarization.py
multi_label_classification.py		multi_label_classification.py
ner.py		ner.py
ocr_gpu_async.py		ocr_gpu_async.py
ocr_inference_backend.py		ocr_inference_backend.py
omr_detection.py		omr_detection.py
optimize_pdfs.py		optimize_pdfs.py
package-lock.json		package-lock.json
package.json		package.json
paddle_compat.py		paddle_compat.py
page_cache.py		page_cache.py
page_routing.py		page_routing.py
playwright.config.js		playwright.config.js
preprocessing.py		preprocessing.py
pytest.ini		pytest.ini
relationship_extraction.py		relationship_extraction.py
reprocess.py		reprocess.py
requirements.txt		requirements.txt
routing.py		routing.py
ruff.toml		ruff.toml
run_ocr.bat		run_ocr.bat
run_ocr_async.bat		run_ocr_async.bat
run_optimization.bat		run_optimization.bat
scale_test.py		scale_test.py
semantic_extraction.py		semantic_extraction.py
sftp_ingest.py		sftp_ingest.py
signature_verification.py		signature_verification.py
sla_monitoring.py		sla_monitoring.py
specialist_routing.py		specialist_routing.py
symbology_extraction.py		symbology_extraction.py
table_fallback.py		table_fallback.py
trocr_recognition.py		trocr_recognition.py
unicode_utils.py		unicode_utils.py
validation.py		validation.py
validation_gates.py		validation_gates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR_LOCAL

Hook & Vision

Architecture Overview

Tech Stack DNA

Quickstart

1) Prerequisites

2) Build and start

3) Drop input files

4) Monitor processing

5) Collect results

Information Flow

Use Cases

Transform and Stamping Support

Documentation Suite

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR_LOCAL

Hook & Vision

Architecture Overview

Tech Stack DNA

Quickstart

1) Prerequisites

2) Build and start

3) Drop input files

4) Monitor processing

5) Collect results

Information Flow

Use Cases

Transform and Stamping Support

Documentation Suite

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages