feat: add hidden OCR template context for post-processing#770
feat: add hidden OCR template context for post-processing#770evrenesat wants to merge 3 commits intocjpais:mainfrom
Conversation
|
Hey, thanks for this. it's gonna take me a while to get to and merge this. Right now I'm mainly trying to fix stability issues. I can't say 100% this will be merged, but I do think that this is a good thing. However, it would be best if we can do this on multiple platforms. |
Sure, but I don't have access to neither Linux or Windows. And Mac version uses completely built-in libraries of the operation system, so it doesn't bring any more dependencies in that sense. And being it's a hidden feature, it wouldn't cause any confusion for users of other platforms, so in it's current form, waiting for feature parity seems a bit unnecessary? Still, with this addition Handy become perfect for me, both feature and reliability wise, so I'm fine living with my own build :) Thanks for the sharing and maintaining Handy! Let me know if you want me to improve this PR. |
|
Understand, and it will probably be merged without other support, just ideally like to not feature things off just to one platform where possible |
|
Just dropping a note, this is going to take me a while to merge I think. I need to keep the feature set stable right now. I feel I'm drowning under PR's and issues and starting to get burned out. |
Introduce a macOS-only experimental OCR template variable path for post-processing prompts.
This is intentionally a hidden feature for now: no user interface changes, no new settings surface,
and no changes to public frontend command APIs.
Architecture and runtime flow:
- Handy remains a Tauri desktop app with Rust backend (`src-tauri`) and React/TypeScript frontend (`src`).
- Post-processing prompt template expansion in `src-tauri/src/actions.rs` now supports:
- `${output}` for transcription text
- `${OCR}` and `${ocr}` for frontmost-window OCR context
- OCR context is only attempted when:
- OCR template variable is present in the selected prompt
- existing `experimental_enabled` is true
- OCR context is truncated to 8000 characters to constrain prompt growth.
- On OCR failure, missing permission, or unsupported path, OCR variable resolves to empty string;
transcription and post-processing continue without hard failure.
macOS OCR subsystem:
- Add Swift bridge:
- `src-tauri/swift/macos_ocr.swift`
- `src-tauri/swift/macos_ocr_bridge.h`
- Add Rust FFI wrapper:
- `src-tauri/src/macos_ocr.rs`
- Wire build integration in `src-tauri/build.rs` with macOS-only compilation/linking.
- Link required Apple frameworks: Foundation, AppKit, CoreGraphics, Vision.
- Request screen recording permission on demand once per app run when OCR is first needed.
- Capture frontmost window, OCR via Vision, inject text into prompt template.
Cross-platform safety:
- Entire OCR bridge and runtime usage is gated for macOS.
- Non-macOS behavior is unchanged; OCR placeholders resolve safely to empty context.
- Existing speech-to-text and provider flows are preserved.
Intent:
- Improve post-processing quality by adding optional visual context from active window text,
especially for correcting terminology/typos from dictation misunderstandings.
- Keep this as an experimental hidden capability first, to validate impact before any UI exposure.
7c11a5a to
c5eb176
Compare
Implement Linux support for OCR prompt variables using Tesseract. Architecture: - Add linux_ocr provider and wire it into fetch_ocr_template_value. - Keep X11 as the primary path (active window to root fallback via Xlib). - Add Wayland fallback strategy based on startup-time environment detection. - GNOME path: gnome-screenshot -f <temp.png> - KDE path: spectacle -b -n -o <temp.png> - Initialize and cache backend selection at app startup to avoid repeated checks during hot paths. Why this implementation: - Keeps existing fast X11 behavior unchanged. - Provides practical Wayland compatibility without portal or D-Bus integration complexity in this PR. - Avoids blocking Linux support on compositor-specific APIs and keeps runtime dependency surface simple. Limitations: - Wayland capture currently depends on external screenshot tools being installed (gnome-screenshot or spectacle). - Active-window fidelity on Wayland is tool/compositor-dependent and not equivalent to direct X11 window capture. - This is an interim compatibility path, not a full portal-based capture backend. Future improvements noted: - Move Wayland capture to xdg-desktop-portal for compositor-agnostic behavior. - Consider GNOME D-Bus screenshot integration (org.gnome.Shell.Screenshot) for better window-targeted capture where appropriate. Also updates: - Install tesseract-ocr in Ubuntu CI build jobs. - Add Debian package dependency for tesseract-ocr. - Update architecture and devlog documentation.
|
@cjpais I have added implementation for all platforms in this PR now. Linux one is a bit hacky, but for now that's the only solution I could come up with. |
Summary
This PR introduces an experimental OCR context path for post-processing prompts across supported desktop platforms.
When the selected post-processing prompt contains
${OCR}or${ocr}, Handy captures text from the frontmost app context (OCR) and injects it into the prompt template before sending to the post-processing model.This is intentionally a hidden/experimental feature for now:
Note: If guided, I can make follow-up changes needed for merge readiness (including UI exposure if desired).
Intent
The goal is to improve post-processing quality by adding optional visual context from the active window/screen, especially for fixing terminology and typo-like dictation misunderstandings.
I’m using the following addition to the default prompt and getting better results than non-OCR post-processing:
I intentionally kept scope minimal and hidden to validate usefulness first, without increasing UI complexity or feature surface.
Architecture / Flow
src-tauri, React/TS frontend insrc).src-tauri/src/actions.rsnow supports:${output}(existing transcript variable)${OCR}/${ocr}(new OCR context variable)experimental_enabledsetting is truePlatform Notes
macOS
src-tauri/swift/macos_ocr.swiftsrc-tauri/swift/macos_ocr_bridge.hsrc-tauri/src/macos_ocr.rssrc-tauri/build.rs(bridge build/link wiring)src-tauri/src/lib.rs(module registration, macOS-gated)src-tauri/src/actions.rs(template expansion + OCR integration + tests)Windows
${OCR}/${ocr}in post-processing flow.src-tauri/src/windows_ocr.rsfor frontmost-window OCR capture on Windows.src-tauri/src/actions.rs.Linux
src-tauri/src/linux_ocr.rsgnome-screenshot -f <temp.png>spectacle -b -n -o <temp.png>tesseract-ocrin Ubuntu CI jobs (.github/workflows/build.yml)tesseract-ocr(src-tauri/tauri.conf.json)x11crate dependency insrc-tauri/Cargo.tomlorg.gnome.Shell.ScreenshotD-Bus path is noted as a potential follow-up for better window-targeted behavior.Cross-platform Safety
Prior Art Check
I searched upstream issues/PRs for OCR-related work and did not find a matching implementation.
Testing
src-tauri/src/actions.rsfor OCR template resolution and truncation behavior.src-tauri/src/linux_ocr.rsfor backend selection and parsing logic.@Generablemacro tooling may fail independently; this PR does not change Apple Intelligence behavior.AI Assistance Disclosure