feat: add hidden OCR template context for post-processing by evrenesat · Pull Request #770 · cjpais/Handy

evrenesat · 2026-02-10T16:25:57Z

Summary

This PR introduces an experimental OCR context path for post-processing prompts across supported desktop platforms.

When the selected post-processing prompt contains ${OCR} or ${ocr}, Handy captures text from the frontmost app context (OCR) and injects it into the prompt template before sending to the post-processing model.

This is intentionally a hidden/experimental feature for now:

No UI changes
No new settings UI
No public frontend command/API changes

Note: If guided, I can make follow-up changes needed for merge readiness (including UI exposure if desired).

Intent

The goal is to improve post-processing quality by adding optional visual context from the active window/screen, especially for fixing terminology and typo-like dictation misunderstandings.

I’m using the following addition to the default prompt and getting better results than non-OCR post-processing:

Relevant context (OCR’ed from active window, use for fixing typos and terminology):
${OCR}

I intentionally kept scope minimal and hidden to validate usefulness first, without increasing UI complexity or feature surface.

Architecture / Flow

Existing backend/frontend architecture remains unchanged (Rust backend in src-tauri, React/TS frontend in src).
Prompt template expansion in src-tauri/src/actions.rs now supports:
- ${output} (existing transcript variable)
- ${OCR} / ${ocr} (new OCR context variable)
OCR is attempted only when:
1. OCR variable exists in the selected prompt
2. Existing experimental_enabled setting is true
OCR text is truncated to 8000 chars to limit prompt growth.
On OCR failure, missing permission, or unsupported conditions, OCR resolves to an empty string and processing continues safely.

Platform Notes

macOS

Added:
- src-tauri/swift/macos_ocr.swift
- src-tauri/swift/macos_ocr_bridge.h
- src-tauri/src/macos_ocr.rs
Updated:
- src-tauri/build.rs (bridge build/link wiring)
- src-tauri/src/lib.rs (module registration, macOS-gated)
- src-tauri/src/actions.rs (template expansion + OCR integration + tests)
Frameworks linked:
- Foundation
- AppKit
- CoreGraphics
- Vision
Permission behavior:
- Screen recording permission requested on demand
- Request-at-most-once per app run
- Graceful fallback to empty OCR context if denied/unavailable

Windows

Added Windows OCR provider integration for ${OCR} / ${ocr} in post-processing flow.
Uses src-tauri/src/windows_ocr.rs for frontmost-window OCR capture on Windows.
Wired into shared prompt-template expansion in src-tauri/src/actions.rs.
Behavior is fail-safe: OCR errors resolve to empty context and do not break transcription/post-processing flow.

Linux

Added:
- src-tauri/src/linux_ocr.rs
Linux OCR strategy:
- X11 session: active-window X11 capture path (with root-window fallback), then Tesseract.
- Wayland session: screenshot-tool fallback, then Tesseract.
Wayland fallback tools:
- GNOME preferred: gnome-screenshot -f <temp.png>
- KDE preferred: spectacle -b -n -o <temp.png>
- NOTE: Unlike other implementations, these cli tools take full-screen screenshots.
Backend selection is detected once at app startup and cached.
Packaging/build updates:
- Install tesseract-ocr in Ubuntu CI jobs (.github/workflows/build.yml)
- Add Debian runtime dependency tesseract-ocr (src-tauri/tauri.conf.json)
- Add Linux x11 crate dependency in src-tauri/Cargo.toml
Current limitation:
- Wayland path depends on screenshot tools being installed.
- This is an interim pragmatic implementation; portal/D-Bus integration can improve this later.
- GNOME’s org.gnome.Shell.Screenshot D-Bus path is noted as a potential follow-up for better window-targeted behavior.

Cross-platform Safety

OCR providers are OS-gated.
Existing transcription and post-processing flows are preserved.
Unsupported/failing OCR paths degrade gracefully to empty OCR context.

Prior Art Check

I searched upstream issues/PRs for OCR-related work and did not find a matching implementation.

Testing

Added/updated unit tests in src-tauri/src/actions.rs for OCR template resolution and truncation behavior.
Added Linux OCR unit tests in src-tauri/src/linux_ocr.rs for backend selection and parsing logic.
Frontend build and tests pass locally.
Linux build/test and AppImage build were verified in Ubuntu VM.
Note: in CommandLineTools-only macOS environments, Apple Intelligence @Generable macro tooling may fail independently; this PR does not change Apple Intelligence behavior.

AI Assistance Disclosure

AI used: Yes
Tools used: GPT-5 Codex
Usage: Extensive assistance across implementation, debugging, build validation, and documentation

cjpais · 2026-02-11T00:48:31Z

Hey, thanks for this. it's gonna take me a while to get to and merge this. Right now I'm mainly trying to fix stability issues.

I can't say 100% this will be merged, but I do think that this is a good thing. However, it would be best if we can do this on multiple platforms.

evrenesat · 2026-02-12T06:51:14Z

it would be best if we can do this on multiple platforms.

Sure, but I don't have access to neither Linux or Windows. And Mac version uses completely built-in libraries of the operation system, so it doesn't bring any more dependencies in that sense. And being it's a hidden feature, it wouldn't cause any confusion for users of other platforms, so in it's current form, waiting for feature parity seems a bit unnecessary?

Still, with this addition Handy become perfect for me, both feature and reliability wise, so I'm fine living with my own build :)

Thanks for the sharing and maintaining Handy! Let me know if you want me to improve this PR.

cjpais · 2026-02-12T07:00:45Z

Understand, and it will probably be merged without other support, just ideally like to not feature things off just to one platform where possible

cjpais · 2026-02-16T13:34:49Z

Just dropping a note, this is going to take me a while to merge I think. I need to keep the feature set stable right now. I feel I'm drowning under PR's and issues and starting to get burned out.

Introduce a macOS-only experimental OCR template variable path for post-processing prompts. This is intentionally a hidden feature for now: no user interface changes, no new settings surface, and no changes to public frontend command APIs. Architecture and runtime flow: - Handy remains a Tauri desktop app with Rust backend (`src-tauri`) and React/TypeScript frontend (`src`). - Post-processing prompt template expansion in `src-tauri/src/actions.rs` now supports: - `${output}` for transcription text - `${OCR}` and `${ocr}` for frontmost-window OCR context - OCR context is only attempted when: - OCR template variable is present in the selected prompt - existing `experimental_enabled` is true - OCR context is truncated to 8000 characters to constrain prompt growth. - On OCR failure, missing permission, or unsupported path, OCR variable resolves to empty string; transcription and post-processing continue without hard failure. macOS OCR subsystem: - Add Swift bridge: - `src-tauri/swift/macos_ocr.swift` - `src-tauri/swift/macos_ocr_bridge.h` - Add Rust FFI wrapper: - `src-tauri/src/macos_ocr.rs` - Wire build integration in `src-tauri/build.rs` with macOS-only compilation/linking. - Link required Apple frameworks: Foundation, AppKit, CoreGraphics, Vision. - Request screen recording permission on demand once per app run when OCR is first needed. - Capture frontmost window, OCR via Vision, inject text into prompt template. Cross-platform safety: - Entire OCR bridge and runtime usage is gated for macOS. - Non-macOS behavior is unchanged; OCR placeholders resolve safely to empty context. - Existing speech-to-text and provider flows are preserved. Intent: - Improve post-processing quality by adding optional visual context from active window text, especially for correcting terminology/typos from dictation misunderstandings. - Keep this as an experimental hidden capability first, to validate impact before any UI exposure.

Implement Linux support for OCR prompt variables using Tesseract. Architecture: - Add linux_ocr provider and wire it into fetch_ocr_template_value. - Keep X11 as the primary path (active window to root fallback via Xlib). - Add Wayland fallback strategy based on startup-time environment detection. - GNOME path: gnome-screenshot -f <temp.png> - KDE path: spectacle -b -n -o <temp.png> - Initialize and cache backend selection at app startup to avoid repeated checks during hot paths. Why this implementation: - Keeps existing fast X11 behavior unchanged. - Provides practical Wayland compatibility without portal or D-Bus integration complexity in this PR. - Avoids blocking Linux support on compositor-specific APIs and keeps runtime dependency surface simple. Limitations: - Wayland capture currently depends on external screenshot tools being installed (gnome-screenshot or spectacle). - Active-window fidelity on Wayland is tool/compositor-dependent and not equivalent to direct X11 window capture. - This is an interim compatibility path, not a full portal-based capture backend. Future improvements noted: - Move Wayland capture to xdg-desktop-portal for compositor-agnostic behavior. - Consider GNOME D-Bus screenshot integration (org.gnome.Shell.Screenshot) for better window-targeted capture where appropriate. Also updates: - Install tesseract-ocr in Ubuntu CI build jobs. - Add Debian package dependency for tesseract-ocr. - Update architecture and devlog documentation.

evrenesat · 2026-02-21T22:20:44Z

@cjpais I have added implementation for all platforms in this PR now. Linux one is a bit hacky, but for now that's the only solution I could come up with.

evrenesat mentioned this pull request Feb 12, 2026

feat(windows): add hidden OCR context capture from frontmost window for post-processing #808

Closed

cjpais mentioned this pull request Feb 17, 2026

added more 3 vars for LLM post-processing transcript #704

Open

4 tasks

evrenesat added 2 commits February 21, 2026 13:53

feat(windows): add OCR context capture for frontmost window

c5eb176

evrenesat force-pushed the codex/macos-ocr-hidden-context branch from 7c11a5a to c5eb176 Compare February 21, 2026 12:56

evrenesat changed the title ~~feat(macos): add hidden OCR template context for post-processing~~ feat: add hidden OCR template context for post-processing Feb 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

feat: add hidden OCR template context for post-processing#770

feat: add hidden OCR template context for post-processing#770
evrenesat wants to merge 3 commits intocjpais:mainfrom
evrenesat:codex/macos-ocr-hidden-context

evrenesat commented Feb 10, 2026 •

edited

Loading

Uh oh!

cjpais commented Feb 11, 2026 •

edited

Loading

Uh oh!

evrenesat commented Feb 12, 2026

Uh oh!

cjpais commented Feb 12, 2026

Uh oh!

cjpais commented Feb 16, 2026

Uh oh!

evrenesat commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

evrenesat commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Intent

Architecture / Flow

Platform Notes

Cross-platform Safety

Prior Art Check

Testing

AI Assistance Disclosure

Uh oh!

cjpais commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evrenesat commented Feb 12, 2026

Uh oh!

cjpais commented Feb 12, 2026

Uh oh!

cjpais commented Feb 16, 2026

Uh oh!

evrenesat commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evrenesat commented Feb 10, 2026 •

edited

Loading

cjpais commented Feb 11, 2026 •

edited

Loading