GLM-OCR

A Streamlit web app for Optical Character Recognition powered by ZAI GLM-OCR. Upload images or PDFs and extract text, formulas, tables, or structured JSON fields — all running locally on your machine, no API key required.

Screenshots

Main interface — upload panel (left) and recognition panel (right)

Sidebar showing hardware notice and live Activity Monitor

Information Extraction mode with JSON schema editor and preset selector

Features

Four extraction modes — Text, Formula, Table, and Information Extraction (JSON schema)
PDF support — renders every page at 2× DPI; navigate pages with a visual preview
Information Extraction presets — Personal ID, Invoice, Receipt, Business Card, or define your own JSON schema
Multi-page processing — run OCR on the current page, a custom range, or all pages at once
Live streaming output — results appear line-by-line as the model generates
Cancellable runs — a Stop button aborts before the next model.generate() call
Live Activity Monitor — sidebar shows CPU %, RAM, Swap, and app memory, refreshing every 2 seconds
Hardware-aware device selection — automatically picks CUDA, Apple MPS, or CPU based on your machine
Download results — per-page .txt or a combined all-pages file

Requirements

Python 3.10+
PyTorch 2.1+ (with CUDA or MPS support as applicable)
Streamlit 1.33+ (required for @st.fragment)

Installation

# 1. Clone or copy this project
git clone <your-repo-url>
cd glm-ocr-app

# 2. Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install streamlit transformers torch torchvision pillow pymupdf psutil

macOS note: If you are on Apple Silicon, install the MPS-enabled build of PyTorch from pytorch.org.

Running the app

streamlit run app.py

The app opens in your browser at http://localhost:8501. The model (~4 GB) is downloaded from Hugging Face on first run and cached in ./models.

Hardware & Performance

The sidebar shows a live hardware notice explaining exactly what your machine will use and how fast to expect results:

Hardware	Mode	Speed estimate
NVIDIA / AMD GPU (CUDA)	float16 on GPU	~5–15 sec / page
Apple Silicon ≥ 16 GB RAM	bfloat16 on MPS	~10–30 sec / page
Apple Silicon < 16 GB RAM	float32 on CPU	~2–5 min / page
Any CPU (no GPU)	float32 on CPU	~2–8 min / page

Why CPU on 8 GB Apple Silicon? GLM-OCR's KV cache during model.generate() needs roughly 6 GB of one contiguous Metal memory buffer. On an 8 GB M1/M2 Mac, after the OS kernel (~2 GB) and model weights (~2–3 GB) are loaded, Metal no longer has enough space for that allocation and will abort with an OOM crash. The app detects this at startup and falls back to CPU automatically — slow but stable.

Project structure

app.py                          # Streamlit entry point (~60 lines)
glm_ocr/
├── config.py                   # All tuneable constants
├── hardware.py                 # RAM detection + device/dtype selection
├── model_loader.py             # HuggingFace download check + model loading
├── pdf_utils.py                # PDF bytes → list of PIL images (PyMuPDF)
├── ocr_result.py               # OcrResult dataclass (no torch dependency)
├── input_builder.py            # Tokeniser + GPU memory cleanup
├── inference.py                # run_ocr_stream() generator + run_ocr() wrapper
└── ui/
    ├── styles.py               # CSS injection + branded header
    ├── sidebar.py              # Settings inputs — orchestrates device_notice + monitor
    ├── device_notice.py        # Per-hardware capability description
    ├── resource_monitor.py     # Live Activity Monitor (@st.fragment, 2 s refresh)
    ├── upload_panel.py         # Left column: upload, preview, page navigation
    ├── result_panel.py         # Right column: orchestrator
    ├── ocr_controls.py         # Extraction mode selector, prompt editor, page range
    └── ocr_runner.py           # Cancellable multi-page execution + live timer

Every file is under 175 lines. No file imports from another with relative dots (from .x) — all imports are absolute, which is required for Streamlit's flat run context.

Extraction modes

Text / Formula / Table

Sends a fixed prompt to the model and streams the result back as plain text.

Mode	Prompt sent to model
Text	`Text Recognition:`
Formula	`Formula Recognition:`
Table	`Table Recognition:`

Information Extraction

Lets you define a JSON schema; the model fills in the empty string values. Built-in presets:

Personal ID — name, date of birth, address, issue/expiry dates
Invoice — vendor, customer, line items, totals, tax
Receipt — store, items, subtotal, payment method
Business Card — name, title, company, contact details
Custom — free-edit text area with live JSON validation

The schema is compacted to a single line before being sent to the model.

Cancellation

model.generate() is a blocking C++ call that cannot be interrupted mid-run. Cancellation is therefore checked at two deterministic checkpoints:

Before generate() — the _log callback raises OcrCancelledError when fired for "Running model.generate()…", aborting the page before the expensive call starts.
Between pages — the multi-page loop checks the cancel flag before each subsequent page.

Pressing ⏹ Stop sets ocr_cancel_requested = True in session state. The current page will complete its generation (this cannot be avoided), but no further pages will start.

Configuration

All constants are in glm_ocr/config.py:

Constant	Default	Description
`DEFAULT_MODEL_ID`	`zai-org/GLM-OCR`	HuggingFace model repo
`DEFAULT_CACHE_DIR`	`./models`	Local model cache path
`DEFAULT_MAX_NEW_TOKENS`	`2048`	Generation budget per page
`MAX_MAX_NEW_TOKENS`	`4096`	Upper limit of sidebar slider
`MPS_MIN_RAM_BYTES`	`16 GB`	Threshold below which MPS is skipped
`PDF_DPI_SCALE`	`2.0`	PDF render scale (2× = ~144 dpi)

To reduce inference time on CPU, lower PDF_DPI_SCALE to 1.5 — this cuts pixel count by ~40% with minimal quality loss for most documents.

CLI usage

inference.py doubles as a command-line tool:

# Run OCR on a single image
python glm_ocr/inference.py path/to/image.png

# Run OCR on page 2 of a PDF
python glm_ocr/inference.py document.pdf --page 1

# Use a different model or cache location
python glm_ocr/inference.py image.png --model-id zai-org/GLM-OCR --cache-dir ./models

Troubleshooting

ModuleNotFoundError: No module named 'streamlit' Run pip install streamlit and ensure you are in the correct virtual environment.

Model download is very slow or fails The model is ~4 GB. Ensure you have a stable connection. Re-running streamlit run app.py will resume a partial download from the HuggingFace cache.

st.fragment causes an error You are on Streamlit < 1.33. Run pip install --upgrade streamlit.

OCR output is truncated on dense pages Increase Max new tokens in the sidebar slider (up to 4096). The default is 2048, which covers most pages; very dense documents may need more.

Timer shows 0.0s and freezes This is expected behaviour — the timer is driven by the _log callback, which fires before and after model.generate(). During the generation itself (which can take several minutes on CPU) the timer shows the time at the last checkpoint, not a live wall-clock tick.

Acknowledgements

ZAI GLM-OCR — the underlying vision-language model
Streamlit — the web framework
HuggingFace Transformers — model loading and inference
PyMuPDF (fitz) — fast PDF rendering

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
.venv		.venv
glm_ocr		glm_ocr
images		images
ui		ui
._models		._models
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLM-OCR

Screenshots

Features

Requirements

Installation

Running the app

Hardware & Performance

Project structure

Extraction modes

Text / Formula / Table

Information Extraction

Cancellation

Configuration

CLI usage

Troubleshooting

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GLM-OCR

Screenshots

Features

Requirements

Installation

Running the app

Hardware & Performance

Project structure

Extraction modes

Text / Formula / Table

Information Extraction

Cancellation

Configuration

CLI usage

Troubleshooting

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages