Skip to content

Comments

feat: add hidden OCR template context for post-processing#770

Open
evrenesat wants to merge 3 commits intocjpais:mainfrom
evrenesat:codex/macos-ocr-hidden-context
Open

feat: add hidden OCR template context for post-processing#770
evrenesat wants to merge 3 commits intocjpais:mainfrom
evrenesat:codex/macos-ocr-hidden-context

Conversation

@evrenesat
Copy link

@evrenesat evrenesat commented Feb 10, 2026

Summary

This PR introduces an experimental OCR context path for post-processing prompts across supported desktop platforms.

When the selected post-processing prompt contains ${OCR} or ${ocr}, Handy captures text from the frontmost app context (OCR) and injects it into the prompt template before sending to the post-processing model.

This is intentionally a hidden/experimental feature for now:

  • No UI changes
  • No new settings UI
  • No public frontend command/API changes

Note: If guided, I can make follow-up changes needed for merge readiness (including UI exposure if desired).

Intent

The goal is to improve post-processing quality by adding optional visual context from the active window/screen, especially for fixing terminology and typo-like dictation misunderstandings.

I’m using the following addition to the default prompt and getting better results than non-OCR post-processing:

Relevant context (OCR’ed from active window, use for fixing typos and terminology):
${OCR}

I intentionally kept scope minimal and hidden to validate usefulness first, without increasing UI complexity or feature surface.

Architecture / Flow

  • Existing backend/frontend architecture remains unchanged (Rust backend in src-tauri, React/TS frontend in src).
  • Prompt template expansion in src-tauri/src/actions.rs now supports:
    • ${output} (existing transcript variable)
    • ${OCR} / ${ocr} (new OCR context variable)
  • OCR is attempted only when:
    1. OCR variable exists in the selected prompt
    2. Existing experimental_enabled setting is true
  • OCR text is truncated to 8000 chars to limit prompt growth.
  • On OCR failure, missing permission, or unsupported conditions, OCR resolves to an empty string and processing continues safely.

Platform Notes

macOS
  • Added:
    • src-tauri/swift/macos_ocr.swift
    • src-tauri/swift/macos_ocr_bridge.h
    • src-tauri/src/macos_ocr.rs
  • Updated:
    • src-tauri/build.rs (bridge build/link wiring)
    • src-tauri/src/lib.rs (module registration, macOS-gated)
    • src-tauri/src/actions.rs (template expansion + OCR integration + tests)
  • Frameworks linked:
    • Foundation
    • AppKit
    • CoreGraphics
    • Vision
  • Permission behavior:
    • Screen recording permission requested on demand
    • Request-at-most-once per app run
    • Graceful fallback to empty OCR context if denied/unavailable
Windows
  • Added Windows OCR provider integration for ${OCR} / ${ocr} in post-processing flow.
  • Uses src-tauri/src/windows_ocr.rs for frontmost-window OCR capture on Windows.
  • Wired into shared prompt-template expansion in src-tauri/src/actions.rs.
  • Behavior is fail-safe: OCR errors resolve to empty context and do not break transcription/post-processing flow.
Linux
  • Added:
    • src-tauri/src/linux_ocr.rs
  • Linux OCR strategy:
    • X11 session: active-window X11 capture path (with root-window fallback), then Tesseract.
    • Wayland session: screenshot-tool fallback, then Tesseract.
  • Wayland fallback tools:
    • GNOME preferred: gnome-screenshot -f <temp.png>
    • KDE preferred: spectacle -b -n -o <temp.png>
    • NOTE: Unlike other implementations, these cli tools take full-screen screenshots.
  • Backend selection is detected once at app startup and cached.
  • Packaging/build updates:
    • Install tesseract-ocr in Ubuntu CI jobs (.github/workflows/build.yml)
    • Add Debian runtime dependency tesseract-ocr (src-tauri/tauri.conf.json)
    • Add Linux x11 crate dependency in src-tauri/Cargo.toml
  • Current limitation:
    • Wayland path depends on screenshot tools being installed.
    • This is an interim pragmatic implementation; portal/D-Bus integration can improve this later.
    • GNOME’s org.gnome.Shell.Screenshot D-Bus path is noted as a potential follow-up for better window-targeted behavior.

Cross-platform Safety

  • OCR providers are OS-gated.
  • Existing transcription and post-processing flows are preserved.
  • Unsupported/failing OCR paths degrade gracefully to empty OCR context.

Prior Art Check

I searched upstream issues/PRs for OCR-related work and did not find a matching implementation.

Testing

  • Added/updated unit tests in src-tauri/src/actions.rs for OCR template resolution and truncation behavior.
  • Added Linux OCR unit tests in src-tauri/src/linux_ocr.rs for backend selection and parsing logic.
  • Frontend build and tests pass locally.
  • Linux build/test and AppImage build were verified in Ubuntu VM.
  • Note: in CommandLineTools-only macOS environments, Apple Intelligence @Generable macro tooling may fail independently; this PR does not change Apple Intelligence behavior.

AI Assistance Disclosure

  • AI used: Yes
  • Tools used: GPT-5 Codex
  • Usage: Extensive assistance across implementation, debugging, build validation, and documentation

@cjpais
Copy link
Owner

cjpais commented Feb 11, 2026

Hey, thanks for this. it's gonna take me a while to get to and merge this. Right now I'm mainly trying to fix stability issues.

I can't say 100% this will be merged, but I do think that this is a good thing. However, it would be best if we can do this on multiple platforms.

@evrenesat
Copy link
Author

it would be best if we can do this on multiple platforms.

Sure, but I don't have access to neither Linux or Windows. And Mac version uses completely built-in libraries of the operation system, so it doesn't bring any more dependencies in that sense. And being it's a hidden feature, it wouldn't cause any confusion for users of other platforms, so in it's current form, waiting for feature parity seems a bit unnecessary?

Still, with this addition Handy become perfect for me, both feature and reliability wise, so I'm fine living with my own build :)

Thanks for the sharing and maintaining Handy! Let me know if you want me to improve this PR.

@cjpais
Copy link
Owner

cjpais commented Feb 12, 2026

Understand, and it will probably be merged without other support, just ideally like to not feature things off just to one platform where possible

@cjpais
Copy link
Owner

cjpais commented Feb 16, 2026

Just dropping a note, this is going to take me a while to merge I think. I need to keep the feature set stable right now. I feel I'm drowning under PR's and issues and starting to get burned out.

Introduce a macOS-only experimental OCR template variable path for post-processing prompts.
This is intentionally a hidden feature for now: no user interface changes, no new settings surface,
and no changes to public frontend command APIs.

Architecture and runtime flow:
- Handy remains a Tauri desktop app with Rust backend (`src-tauri`) and React/TypeScript frontend (`src`).
- Post-processing prompt template expansion in `src-tauri/src/actions.rs` now supports:
  - `${output}` for transcription text
  - `${OCR}` and `${ocr}` for frontmost-window OCR context
- OCR context is only attempted when:
  - OCR template variable is present in the selected prompt
  - existing `experimental_enabled` is true
- OCR context is truncated to 8000 characters to constrain prompt growth.
- On OCR failure, missing permission, or unsupported path, OCR variable resolves to empty string;
  transcription and post-processing continue without hard failure.

macOS OCR subsystem:
- Add Swift bridge:
  - `src-tauri/swift/macos_ocr.swift`
  - `src-tauri/swift/macos_ocr_bridge.h`
- Add Rust FFI wrapper:
  - `src-tauri/src/macos_ocr.rs`
- Wire build integration in `src-tauri/build.rs` with macOS-only compilation/linking.
- Link required Apple frameworks: Foundation, AppKit, CoreGraphics, Vision.
- Request screen recording permission on demand once per app run when OCR is first needed.
- Capture frontmost window, OCR via Vision, inject text into prompt template.

Cross-platform safety:
- Entire OCR bridge and runtime usage is gated for macOS.
- Non-macOS behavior is unchanged; OCR placeholders resolve safely to empty context.
- Existing speech-to-text and provider flows are preserved.

Intent:
- Improve post-processing quality by adding optional visual context from active window text,
  especially for correcting terminology/typos from dictation misunderstandings.
- Keep this as an experimental hidden capability first, to validate impact before any UI exposure.
@evrenesat evrenesat force-pushed the codex/macos-ocr-hidden-context branch from 7c11a5a to c5eb176 Compare February 21, 2026 12:56
Implement Linux support for OCR prompt variables using Tesseract.

Architecture:
- Add linux_ocr provider and wire it into fetch_ocr_template_value.
- Keep X11 as the primary path (active window to root fallback via Xlib).
- Add Wayland fallback strategy based on startup-time environment detection.
  - GNOME path: gnome-screenshot -f <temp.png>
  - KDE path: spectacle -b -n -o <temp.png>
- Initialize and cache backend selection at app startup to avoid repeated checks during hot paths.

Why this implementation:
- Keeps existing fast X11 behavior unchanged.
- Provides practical Wayland compatibility without portal or D-Bus integration complexity in this PR.
- Avoids blocking Linux support on compositor-specific APIs and keeps runtime dependency surface simple.

Limitations:
- Wayland capture currently depends on external screenshot tools being installed (gnome-screenshot or spectacle).
- Active-window fidelity on Wayland is tool/compositor-dependent and not equivalent to direct X11 window capture.
- This is an interim compatibility path, not a full portal-based capture backend.

Future improvements noted:
- Move Wayland capture to xdg-desktop-portal for compositor-agnostic behavior.
- Consider GNOME D-Bus screenshot integration (org.gnome.Shell.Screenshot) for better window-targeted capture where appropriate.

Also updates:
- Install tesseract-ocr in Ubuntu CI build jobs.
- Add Debian package dependency for tesseract-ocr.
- Update architecture and devlog documentation.
@evrenesat evrenesat changed the title feat(macos): add hidden OCR template context for post-processing feat: add hidden OCR template context for post-processing Feb 21, 2026
@evrenesat
Copy link
Author

@cjpais I have added implementation for all platforms in this PR now. Linux one is a bit hacky, but for now that's the only solution I could come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants