Skip to content

tsejavhaa/GLM_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GLM-OCR

A Streamlit web app for Optical Character Recognition powered by ZAI GLM-OCR. Upload images or PDFs and extract text, formulas, tables, or structured JSON fields — all running locally on your machine, no API key required.

GLM-OCR header


Screenshots

App overview Main interface — upload panel (left) and recognition panel (right)

Sidebar monitor Sidebar showing hardware notice and live Activity Monitor

Information extraction Information Extraction mode with JSON schema editor and preset selector


Features

  • Four extraction modes — Text, Formula, Table, and Information Extraction (JSON schema)
  • PDF support — renders every page at 2× DPI; navigate pages with a visual preview
  • Information Extraction presets — Personal ID, Invoice, Receipt, Business Card, or define your own JSON schema
  • Multi-page processing — run OCR on the current page, a custom range, or all pages at once
  • Live streaming output — results appear line-by-line as the model generates
  • Cancellable runs — a Stop button aborts before the next model.generate() call
  • Live Activity Monitor — sidebar shows CPU %, RAM, Swap, and app memory, refreshing every 2 seconds
  • Hardware-aware device selection — automatically picks CUDA, Apple MPS, or CPU based on your machine
  • Download results — per-page .txt or a combined all-pages file

Requirements

  • Python 3.10+
  • PyTorch 2.1+ (with CUDA or MPS support as applicable)
  • Streamlit 1.33+ (required for @st.fragment)

Installation

# 1. Clone or copy this project
git clone <your-repo-url>
cd glm-ocr-app

# 2. Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install streamlit transformers torch torchvision pillow pymupdf psutil

macOS note: If you are on Apple Silicon, install the MPS-enabled build of PyTorch from pytorch.org.


Running the app

streamlit run app.py

The app opens in your browser at http://localhost:8501. The model (~4 GB) is downloaded from Hugging Face on first run and cached in ./models.


Hardware & Performance

The sidebar shows a live hardware notice explaining exactly what your machine will use and how fast to expect results:

Hardware Mode Speed estimate
NVIDIA / AMD GPU (CUDA) float16 on GPU ~5–15 sec / page
Apple Silicon ≥ 16 GB RAM bfloat16 on MPS ~10–30 sec / page
Apple Silicon < 16 GB RAM float32 on CPU ~2–5 min / page
Any CPU (no GPU) float32 on CPU ~2–8 min / page

Why CPU on 8 GB Apple Silicon? GLM-OCR's KV cache during model.generate() needs roughly 6 GB of one contiguous Metal memory buffer. On an 8 GB M1/M2 Mac, after the OS kernel (~2 GB) and model weights (~2–3 GB) are loaded, Metal no longer has enough space for that allocation and will abort with an OOM crash. The app detects this at startup and falls back to CPU automatically — slow but stable.


Project structure

app.py                          # Streamlit entry point (~60 lines)
glm_ocr/
├── config.py                   # All tuneable constants
├── hardware.py                 # RAM detection + device/dtype selection
├── model_loader.py             # HuggingFace download check + model loading
├── pdf_utils.py                # PDF bytes → list of PIL images (PyMuPDF)
├── ocr_result.py               # OcrResult dataclass (no torch dependency)
├── input_builder.py            # Tokeniser + GPU memory cleanup
├── inference.py                # run_ocr_stream() generator + run_ocr() wrapper
└── ui/
    ├── styles.py               # CSS injection + branded header
    ├── sidebar.py              # Settings inputs — orchestrates device_notice + monitor
    ├── device_notice.py        # Per-hardware capability description
    ├── resource_monitor.py     # Live Activity Monitor (@st.fragment, 2 s refresh)
    ├── upload_panel.py         # Left column: upload, preview, page navigation
    ├── result_panel.py         # Right column: orchestrator
    ├── ocr_controls.py         # Extraction mode selector, prompt editor, page range
    └── ocr_runner.py           # Cancellable multi-page execution + live timer

Every file is under 175 lines. No file imports from another with relative dots (from .x) — all imports are absolute, which is required for Streamlit's flat run context.


Extraction modes

Text / Formula / Table

Sends a fixed prompt to the model and streams the result back as plain text.

Mode Prompt sent to model
Text Text Recognition:
Formula Formula Recognition:
Table Table Recognition:

Information Extraction

Lets you define a JSON schema; the model fills in the empty string values. Built-in presets:

  • Personal ID — name, date of birth, address, issue/expiry dates
  • Invoice — vendor, customer, line items, totals, tax
  • Receipt — store, items, subtotal, payment method
  • Business Card — name, title, company, contact details
  • Custom — free-edit text area with live JSON validation

The schema is compacted to a single line before being sent to the model.


Cancellation

model.generate() is a blocking C++ call that cannot be interrupted mid-run. Cancellation is therefore checked at two deterministic checkpoints:

  1. Before generate() — the _log callback raises OcrCancelledError when fired for "Running model.generate()…", aborting the page before the expensive call starts.
  2. Between pages — the multi-page loop checks the cancel flag before each subsequent page.

Pressing ⏹ Stop sets ocr_cancel_requested = True in session state. The current page will complete its generation (this cannot be avoided), but no further pages will start.


Configuration

All constants are in glm_ocr/config.py:

Constant Default Description
DEFAULT_MODEL_ID zai-org/GLM-OCR HuggingFace model repo
DEFAULT_CACHE_DIR ./models Local model cache path
DEFAULT_MAX_NEW_TOKENS 2048 Generation budget per page
MAX_MAX_NEW_TOKENS 4096 Upper limit of sidebar slider
MPS_MIN_RAM_BYTES 16 GB Threshold below which MPS is skipped
PDF_DPI_SCALE 2.0 PDF render scale (2× = ~144 dpi)

To reduce inference time on CPU, lower PDF_DPI_SCALE to 1.5 — this cuts pixel count by ~40% with minimal quality loss for most documents.


CLI usage

inference.py doubles as a command-line tool:

# Run OCR on a single image
python glm_ocr/inference.py path/to/image.png

# Run OCR on page 2 of a PDF
python glm_ocr/inference.py document.pdf --page 1

# Use a different model or cache location
python glm_ocr/inference.py image.png --model-id zai-org/GLM-OCR --cache-dir ./models

Troubleshooting

ModuleNotFoundError: No module named 'streamlit' Run pip install streamlit and ensure you are in the correct virtual environment.

Model download is very slow or fails The model is ~4 GB. Ensure you have a stable connection. Re-running streamlit run app.py will resume a partial download from the HuggingFace cache.

st.fragment causes an error You are on Streamlit < 1.33. Run pip install --upgrade streamlit.

OCR output is truncated on dense pages Increase Max new tokens in the sidebar slider (up to 4096). The default is 2048, which covers most pages; very dense documents may need more.

Timer shows 0.0s and freezes This is expected behaviour — the timer is driven by the _log callback, which fires before and after model.generate(). During the generation itself (which can take several minutes on CPU) the timer shows the time at the last checkpoint, not a live wall-clock tick.


Acknowledgements

About

A Streamlit web app for Optical Character Recognition powered by GLM-OCR.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors