feat: add streaming transcription preview to recording overlay by phiresky · Pull Request #864 · cjpais/Handy

phiresky · 2026-02-20T14:49:35Z

Show a live transcript preview in the recording overlay while audio is being captured. A background thread periodically snapshots accumulated audio samples and runs them through the transcription engine, emitting updates to the overlay UI in real time.

It also requires no VAD compared to your @cjpais PR from three days ago.

I find this very helpful feedback to see which text is going to be entered.

This is much simpler than live inputting text into a real application because this has zero quality loss (no chunking) - it retroactively changes when the audio changes by simply processing the full thing in a 500ms loop ("fake" incremental processing).

This has basically no disadvantages apart from CPU use during talking.

Video:

2026-02-20-15-25-54.mp4

Before Submitting This PR

Please confirm you have done the following:

I have searched existing issues and pull requests (including closed ones) to ensure this isn't a duplicate
[kinda] I have read CONTRIBUTING.md

Human Written Description

Related Issues/Discussions

Fixes #

Community Feedback

AI Assistance

No AI was used in this PR
AI was used (please describe below)
[ ]
Tools used: Claude Opus 4.6

In total this was around ~60mins of human effort. I did a bunch of manual changes to simplify and because AI could not figure out the CSS.

AI Disclosure Summary:
The core feature was implemented across three sessions. The initial prompt was to show a live streaming transcript in the recording overlay popup. The AI planned the architecture, then implemented it end-to-end: a snapshot() method on the recorder to clone accumulated audio mid-recording, a background thread that periodically transcribes the snapshot and emits events, and a React frontend that listens and displays the live text. Subsequent prompts were iterative refinements — fixing layout issues, debugging a race condition, and simplifying the code.

Prompt (paraphrased)	AI Action
"Implement a feature where the small popup shown during recording shows a streaming transcript"	Planned the full architecture, then implemented across 8 files: `snapshot()` on the recorder, streaming loop in `actions.rs`, event emission to overlay, React listener + display
"The text gets cut off when too long. Also do it 1/second"	Changed interval from 3s to 1s, added CSS word-wrap with auto-scroll
(Reported race condition with logs showing concurrent transcription access)	Added a serialization lock to prevent streaming and final transcriptions from colliding
"Transparency gets darker when new text appears"	Abandoned dynamic window resizing (caused compositing artifacts on GTK). Switched to fixed-size transparent window where content grows via CSS
"Make it so that the transcript and the bars both show — bars below text"	Modified overlay to show both transcript text AND audio bars simultaneously
"Can we simplify the transcription lock?"	Removed the separate lock, consolidated engine lock to be held for the full transcription duration
"Why did you change some of the comments?"	Acknowledged over-editing mistake, restored original comments

Files: actions.rs, recorder.rs, audio.rs, transcription.rs, overlay.rs, RecordingOverlay.tsx, RecordingOverlay.css, index.html

Show a live transcript preview in the recording overlay while audio is being captured. A background thread periodically snapshots accumulated audio samples and runs them through the transcription engine, emitting updates to the overlay UI in real time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cjpais · 2026-02-20T15:00:01Z

The UI needs improvements (centering?????????). Animations? Sizing??? If we're gonna pull in a change, we need to have the UI be better than it was before. I will not accept regressions in terms of UI.

I suspect that this will perform quite badly on a lot of people's computers (especially those doing CPU inference). Likely because of that it's not going to be pulled in. That is: for systems that cannot transcribe within the given time (500ms) this will lead to inevitable back pressure and unresponsiveness. This is because you're having to retranscribe the buffer every time, And as the buffer gets longer, the amount of time to transcribe it will also get longer. Well it's good for transcription quality, it is bad for long transcriptions and responsiveness of the application.

To be honest, all in all, I'm not necessarily opposed to the doing this. It should come after the PR that I'm proposing because this is effectively another streaming methodology. We can kind of wrap all of the streaming methodologies in one place. The main issue I have is the UI change. And that bit I don't think is going to be pulled in.

phiresky · 2026-02-20T15:13:48Z

If we're gonna pull in a change, we need to have the UI be better than it was before.

I have some weird bug that I think is because of Wayland which causes the transparency to look broken, I think unrelated to this change.

The reason for doing left-aligned text is that when new text is added and you center text, the old text will keep moving around. Not sure how that can be improved? But yeah I didn't spend much effort on the styling. The reason it switches from centered to left-aligned I guess is because the grid/flex element at the beginning is not full width, that could easily be changed.

it is bad for long transcriptions and responsiveness of the application.

Well, pocket-tts advertises 6x real time speed on "a CPU of MacBook Air M4". I'm running on a fairly weak laptop and it feels pretty responsive. My use case is mostly short texts though. But for a long text feedback is even more important? This does not block the UI and it should only increase the latency after releasing the push to talk by 1.5x (since the engine is locked it on avg has to wait for 50% of one processing), though this could be reduced to 0 I think, not sure how the engine can handle threading.

inevitable back pressure

Back pressure is good! If processing of 500ms of audio takes 2 seconds then this simply means the UI will only refresh every 2 seconds, taking the latest available audio.

I did not myself notice any negative change in responsiveness. Yes, for long transcripts This might be a bit of an issue, but what's the worst case? The worst case is pinning the CPU at 100% while you're talking, because at most one transcript is running simultaneously. Ofc. the update interval could also be reduced the longer the text gets if 100% CPU isn't acceptable.

cjpais · 2026-02-20T15:32:10Z

I don't think you've considered the user experience overall

My use case is mostly short texts though.
The worst case is pinning the CPU at 100%

The reason for doing left-aligned text is that when new text is added and you center text, the old text will keep moving around. Not sure how that can be improved? But yeah I didn't spend much effort on the styling. The reason it switches from centered to left-aligned I guess is because the grid/flex element at the beginning is not full width, that could easily be changed.

I didn't mean the text, I meant the icons and the general look and feel of it overall. It doesn't look or feel very good. Already the app is lacking in good design and we don't need to throw more bad design into the mix.

Well, pocket-tts advertises 6x real time speed on "a CPU of MacBook Air M4".

I don't know why you're even bringing this up. This is a speech to text application, not a text to speech application. We dont even support pocket-tts.

So overall, based on what you've said, I'm led to believe this is a low quality contribution overall (you're talking nonsense about tts, you can't see low quality ui, and you think it's acceptable to just pin someone's CPU at 100% without any option to turn it off). And again, I made my very specific point. This may be considered as another option after my PR is pulled in.

I don't have a lot of time right now and extra features and contributions that don't have specific community support are very unlikely to make it in and be pulled in by me. Right now the app is largely in a feature freeze, I'm going to be the only person pulling in new features for the time being. Anyone can submit new features without a lot of time and effort due to the AI tooling (1 hour is a minuscule amount of time, I've spent probably near 1000 or more hours supporting and building this app). I also use the AI tooling extensively, and I opened source the app so you can fork and add features for yourself! Because I am committed to making this a stable app for everyone.

If you really want to help the application, the biggest help would be to be solving the outstanding issues on Linux, because I can't test every configuration very easily, so the more support I have from more users, the better. Right now we're having to focus on stability rather than features because the app is not fundamentally stable across all platforms. And because of this I cannot just keep pulling in new features all the time because it just makes the app more unstable.

If you don't like my answer, please fork. Right now I'm committed to the stability of the application rather than supporting every single feature that can possibly be. Someone has to draw the line in the sand for the app and that's me. If you don't like my line, fork.

"Your search for the right speech-to-text tool can end here—not because Handy is perfect, but because you can make it perfect for you." - I think my line generally supports this quote from the readme as I support forking the application so you can implement the features you want for yourself. If you really, really want this feature, go find someone to make a great UI for it. And I'll consider it.

phiresky · 2026-02-20T15:44:43Z

I meant the icons and the general look and feel of it overall.

But.. I changed nothing about those? This is how it looks for me in the released version:

(including that weird frame, that's resolved in the main branch, but I guess setting that "overlay" flag is also what makes the transparency be broken as you can see in my above video).

The only thing I see that's worse is that when the window gets larger, the icons are stuck to the top.

This is a speech to text application, not a text to speech application. We dont even support pocket-tts.

Sorry, I'm working on a TTS thing at the same time and mixed them up. I meant to look it up for parakeet, the default model here. I'm getting delays of < 500ms for texts of ~3 sentences, and like I said, any "latency" we introduce here is better than not showing any text at all (unlimited latency until you are done talking).

1 hour is a minuscule amount of time

That's why I put a huge AI disclaimer. You don't have to be a dick about it, and the above mistake is something that happens when you're interacting with a human instead of an AI.

In any case, it's fine, I hope you manage to implement your better solution some time.

cjpais · 2026-02-20T15:53:08Z

the positioning of the icons and everything here is all off. it needs a rethink! a new feature can't always just slap onto the app and be good enough, sometimes it needs more care

1 hour is a minuscule amount of time
That's why I put a huge AI disclaimer. You don't have to be a dick about it, and the above mistake is something that happens when you're interacting with a human instead of an AI.

The point I'm making (albeit while being a dick) is that if you want this, give a shit and make it great.

Anyone can submit me mediocre, and they do every single day. I waste my time on mediocre all day. I don't have time for that anymore. That's why I'm being an asshole. Submit me something great.

I'm tired of half assed slop someone made that they can keep to themselves. Anyone can do that and keep it for themselves now, that's the beauty of open source.

If you want to contribute here you must consider the large userbase and how it might impact them.

We need to try and make things better in every PR. Making the app more stable. The code cleaner. It's the only way that supports the people who want to slop it up for themselves

phiresky · 2026-02-20T16:07:18Z

Honestly I mainly wrote this because I was confused why in all of the long discussions linked above about having some, any, real time feedback no one did anything about it, because for me (here) something is better than nothing, because it helps a lot with confidence in the program and with other bugs (sometimes the transcripts for me are just empty, sometimes USR2 does not get registered).

It definitely at least requires a config flag, and real live transcription would obviously be better. I sent this as is because it makes no sense to spend hours and hours on something without getting any form of feedback on whether something would even be considered, and I prefer getting a prototype rather than getting a random feature demand as an issue.

But yes, I just started using this project today, thank you for a great project! I understand the frustration of maintaining open source, and it's difficult to stay constructive when you care much more than the other person. I will consider spending more time on contributing if I'm still using this regularily some time in the future.

phiresky mentioned this pull request Feb 20, 2026

feature request: On the fly transcription cjpais/transcribe-rs#4

Open

cjpais closed this Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

feat: add streaming transcription preview to recording overlay#864

feat: add streaming transcription preview to recording overlay#864
phiresky wants to merge 1 commit intocjpais:mainfrom
phiresky:feat/streaming-transcription-overlay

phiresky commented Feb 20, 2026 •

edited

Loading

Uh oh!

cjpais commented Feb 20, 2026 •

edited

Loading

Uh oh!

phiresky commented Feb 20, 2026 •

edited

Loading

Uh oh!

cjpais commented Feb 20, 2026 •

edited

Loading

Uh oh!

phiresky commented Feb 20, 2026 •

edited

Loading

Uh oh!

cjpais commented Feb 20, 2026 •

edited

Loading

Uh oh!

phiresky commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

phiresky commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before Submitting This PR

Human Written Description

Related Issues/Discussions

Community Feedback

AI Assistance

Uh oh!

cjpais commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phiresky commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phiresky commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phiresky commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phiresky commented Feb 20, 2026 •

edited

Loading

cjpais commented Feb 20, 2026 •

edited

Loading

phiresky commented Feb 20, 2026 •

edited

Loading

cjpais commented Feb 20, 2026 •

edited

Loading

phiresky commented Feb 20, 2026 •

edited

Loading

cjpais commented Feb 20, 2026 •

edited

Loading