Skip to content

mattmre/OCR_LOCAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

619 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR_LOCAL

Python PaddleOCR FastAPI Django Docker Kubernetes

Hook & Vision

OCR_LOCAL is a forensic-grade OCR platform for high-volume, mixed-format document processing. It turns scans, PDFs, and images into searchable, auditable outputs with deterministic fallback behavior, language-aware OCR routing, and optional enrichment layers.

The project exists to solve three hard problems at once:

  • Preserve evidentiary integrity when OCR quality is inconsistent.
  • Process large document sets without losing recoverability after failures.
  • Serve both local batch workflows and distributed queue-based workloads.

Note

This docs suite reflects the broader canonical docs refresh for the current branch.

Architecture Overview

flowchart TB
    subgraph Ingestion["Ingestion"]
        FS["ocr_source/"] --> S1["Scheduler"]
        API["FastAPI /api"] --> S1
    end

    subgraph Pipeline["Async Pipeline"]
        S1 --> S2["CPU Extractors"]
        S2 --> S3["GPU OCR Workers"]
        S3 --> S4["Assembler"]
        S4 --> S5["Ghostscript Compression"]
        S5 --> OUT["ocr_output/EXPORT"]
    end

    subgraph Sidecars["Feature Sidecars"]
        S3 --> NER["NER / Extraction"]
        S3 --> BAR["Barcode / OMR"]
        S3 --> ML["LayoutLMv3 / Embeddings"]
    end

    subgraph Distributed["Distributed Coordinator"]
        D1["Django + Celery"] --> D2["RabbitMQ"]
        D1 --> D3["Redis"]
        D1 --> D4["PostgreSQL"]
    end
Loading

Tech Stack DNA

Layer Stack
Core OCR PaddleOCR, PaddlePaddle, Tesseract fallback
Processing PyMuPDF, pdf2image, Pillow, OpenCV, NumPy
Language Routing FastText (lid.176.bin)
API FastAPI, SQLAlchemy, SlowAPI, WebSockets
Distributed Mode Django, Celery, RabbitMQ, Redis, PostgreSQL
Observability OpenTelemetry, Prometheus, Grafana
Runtime Docker, Kubernetes, NVIDIA Container Toolkit, Ghostscript

Quickstart

1) Prerequisites

Requirement Why
Docker + Compose Primary runtime path
NVIDIA drivers + Toolkit GPU acceleration for PaddleOCR
20+ GB free disk Model cache and output artifacts

2) Build and start

docker compose up -d --build
docker ps --filter "name=ocr_gpu_processor"

3) Drop input files

Place PDFs/images into ocr_source/ (subfolders are supported).

4) Monitor processing

docker logs -f ocr_gpu_processor

5) Collect results

Artifacts are written under ocr_output/EXPORT/:

  • PDF/ searchable PDFs
  • TEXT/ plain text
  • STRUCTURE/, NER/, EXTRACTION/, CLASSIFICATION/, HANDWRITING/, VALIDATION/

Tip

See docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for exact environment variables and distributed startup.

Information Flow

flowchart LR
    A[ocr_source] --> B[Scheduler]
    B --> C[CPU Extractors]
    C --> D[GPU OCR Workers]
    D --> E[Assembler]
    E --> F[Ghostscript Compressors]
    F --> G[ocr_output/EXPORT]
    D --> H[Feature Sidecars]
    H --> G
Loading

Use Cases

User Type Primary Outcome Typical Path
Developer Integrate OCR into apps FastAPI endpoints + SDKs
Operations Engineer Scale throughput Django/Celery distributed coordinator
Forensic Analyst Preserve legal defensibility Chain-of-custody + image-only fallback
Data Team Extract structured signals NER + classification + extraction sidecars

Transform and Stamping Support

Post-OCR document modification capabilities for forensic workflows:

  • Transforms: PDF page operations, format conversion, and preprocessing
  • Stamps: Bates numbering and confidentiality designation overlays
  • Forensic Safeguards: Custody logging, validation gates, hash-linked chains

Enable via feature flags:

environment:
  - ENABLE_TRANSFORMS=true
  - ENABLE_STAMPING=true

See docs/07-TRANSFORMS-STAMPING.md for API endpoints and operator workflows.

Documentation Suite

Document Purpose
docs/README.md Navigation hub
docs/EXECUTIVE-SUMMARY.md Stakeholder-level product summary
docs/00-SYSTEM-BLUEPRINT.md Architecture baseline and boundaries
docs/01-TECH-STACK-DNA.md Technologies, dependencies, and roles
docs/02-QUICKSTART-5-MINUTE-SUCCESS.md Step-by-step setup
docs/03-INFORMATION-FLOWS.md End-to-end data movement and API flow
docs/04-USE-CASES.md Role-based scenarios
docs/05-INTERACTIVE-WALKTHROUGH.md Guided codebase tour and entry points
docs/06-CONFIGURATION-REFERENCE.md Environment variables and feature flags
docs/07-TRANSFORMS-STAMPING.md Transform and stamp operations guide
docs/08-SDK-REFERENCE.md SDK clients
docs/09-TROUBLESHOOTING.md Common issues and logs
docs/10-MONITORING-OPERATIONS.md Observability and operations
docs/11-ML-TRAINING-GUIDE.md Model training and customization

License

Internal / Proprietary. Built on top of open-source dependencies including PaddleOCR (Apache 2.0).

About

Forensic-grade OCR platform for high-volume document processing. PaddleOCR with Tesseract fallback, language-aware routing, distributed queue processing, and evidentiary integrity preservation.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors