OCR_LOCAL is a forensic-grade OCR platform for high-volume, mixed-format document processing. It turns scans, PDFs, and images into searchable, auditable outputs with deterministic fallback behavior, language-aware OCR routing, and optional enrichment layers.
The project exists to solve three hard problems at once:
- Preserve evidentiary integrity when OCR quality is inconsistent.
- Process large document sets without losing recoverability after failures.
- Serve both local batch workflows and distributed queue-based workloads.
Note
This docs suite reflects the broader canonical docs refresh for the current branch.
flowchart TB
subgraph Ingestion["Ingestion"]
FS["ocr_source/"] --> S1["Scheduler"]
API["FastAPI /api"] --> S1
end
subgraph Pipeline["Async Pipeline"]
S1 --> S2["CPU Extractors"]
S2 --> S3["GPU OCR Workers"]
S3 --> S4["Assembler"]
S4 --> S5["Ghostscript Compression"]
S5 --> OUT["ocr_output/EXPORT"]
end
subgraph Sidecars["Feature Sidecars"]
S3 --> NER["NER / Extraction"]
S3 --> BAR["Barcode / OMR"]
S3 --> ML["LayoutLMv3 / Embeddings"]
end
subgraph Distributed["Distributed Coordinator"]
D1["Django + Celery"] --> D2["RabbitMQ"]
D1 --> D3["Redis"]
D1 --> D4["PostgreSQL"]
end
| Layer | Stack |
|---|---|
| Core OCR | PaddleOCR, PaddlePaddle, Tesseract fallback |
| Processing | PyMuPDF, pdf2image, Pillow, OpenCV, NumPy |
| Language Routing | FastText (lid.176.bin) |
| API | FastAPI, SQLAlchemy, SlowAPI, WebSockets |
| Distributed Mode | Django, Celery, RabbitMQ, Redis, PostgreSQL |
| Observability | OpenTelemetry, Prometheus, Grafana |
| Runtime | Docker, Kubernetes, NVIDIA Container Toolkit, Ghostscript |
| Requirement | Why |
|---|---|
| Docker + Compose | Primary runtime path |
| NVIDIA drivers + Toolkit | GPU acceleration for PaddleOCR |
| 20+ GB free disk | Model cache and output artifacts |
docker compose up -d --build
docker ps --filter "name=ocr_gpu_processor"Place PDFs/images into ocr_source/ (subfolders are supported).
docker logs -f ocr_gpu_processorArtifacts are written under ocr_output/EXPORT/:
PDF/searchable PDFsTEXT/plain textSTRUCTURE/,NER/,EXTRACTION/,CLASSIFICATION/,HANDWRITING/,VALIDATION/
Tip
See docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for exact environment variables and distributed startup.
flowchart LR
A[ocr_source] --> B[Scheduler]
B --> C[CPU Extractors]
C --> D[GPU OCR Workers]
D --> E[Assembler]
E --> F[Ghostscript Compressors]
F --> G[ocr_output/EXPORT]
D --> H[Feature Sidecars]
H --> G
| User Type | Primary Outcome | Typical Path |
|---|---|---|
| Developer | Integrate OCR into apps | FastAPI endpoints + SDKs |
| Operations Engineer | Scale throughput | Django/Celery distributed coordinator |
| Forensic Analyst | Preserve legal defensibility | Chain-of-custody + image-only fallback |
| Data Team | Extract structured signals | NER + classification + extraction sidecars |
Post-OCR document modification capabilities for forensic workflows:
- Transforms: PDF page operations, format conversion, and preprocessing
- Stamps: Bates numbering and confidentiality designation overlays
- Forensic Safeguards: Custody logging, validation gates, hash-linked chains
Enable via feature flags:
environment:
- ENABLE_TRANSFORMS=true
- ENABLE_STAMPING=trueSee docs/07-TRANSFORMS-STAMPING.md for API endpoints and operator workflows.
| Document | Purpose |
|---|---|
| docs/README.md | Navigation hub |
| docs/EXECUTIVE-SUMMARY.md | Stakeholder-level product summary |
| docs/00-SYSTEM-BLUEPRINT.md | Architecture baseline and boundaries |
| docs/01-TECH-STACK-DNA.md | Technologies, dependencies, and roles |
| docs/02-QUICKSTART-5-MINUTE-SUCCESS.md | Step-by-step setup |
| docs/03-INFORMATION-FLOWS.md | End-to-end data movement and API flow |
| docs/04-USE-CASES.md | Role-based scenarios |
| docs/05-INTERACTIVE-WALKTHROUGH.md | Guided codebase tour and entry points |
| docs/06-CONFIGURATION-REFERENCE.md | Environment variables and feature flags |
| docs/07-TRANSFORMS-STAMPING.md | Transform and stamp operations guide |
| docs/08-SDK-REFERENCE.md | SDK clients |
| docs/09-TROUBLESHOOTING.md | Common issues and logs |
| docs/10-MONITORING-OPERATIONS.md | Observability and operations |
| docs/11-ML-TRAINING-GUIDE.md | Model training and customization |
Internal / Proprietary. Built on top of open-source dependencies including PaddleOCR (Apache 2.0).