feat: add streaming transcription preview to recording overlay#864
feat: add streaming transcription preview to recording overlay#864phiresky wants to merge 1 commit intocjpais:mainfrom
Conversation
Show a live transcript preview in the recording overlay while audio is being captured. A background thread periodically snapshots accumulated audio samples and runs them through the transcription engine, emitting updates to the overlay UI in real time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The UI needs improvements (centering?????????). Animations? Sizing??? If we're gonna pull in a change, we need to have the UI be better than it was before. I will not accept regressions in terms of UI. I suspect that this will perform quite badly on a lot of people's computers (especially those doing CPU inference). Likely because of that it's not going to be pulled in. That is: for systems that cannot transcribe within the given time (500ms) this will lead to inevitable back pressure and unresponsiveness. This is because you're having to retranscribe the buffer every time, And as the buffer gets longer, the amount of time to transcribe it will also get longer. Well it's good for transcription quality, it is bad for long transcriptions and responsiveness of the application. To be honest, all in all, I'm not necessarily opposed to the doing this. It should come after the PR that I'm proposing because this is effectively another streaming methodology. We can kind of wrap all of the streaming methodologies in one place. The main issue I have is the UI change. And that bit I don't think is going to be pulled in. |
I have some weird bug that I think is because of Wayland which causes the transparency to look broken, I think unrelated to this change. The reason for doing left-aligned text is that when new text is added and you center text, the old text will keep moving around. Not sure how that can be improved? But yeah I didn't spend much effort on the styling. The reason it switches from centered to left-aligned I guess is because the grid/flex element at the beginning is not full width, that could easily be changed.
Well, pocket-tts advertises 6x real time speed on "a CPU of MacBook Air M4". I'm running on a fairly weak laptop and it feels pretty responsive. My use case is mostly short texts though. But for a long text feedback is even more important? This does not block the UI and it should only increase the latency after releasing the push to talk by 1.5x (since the engine is locked it on avg has to wait for 50% of one processing), though this could be reduced to 0 I think, not sure how the engine can handle threading.
Back pressure is good! If processing of 500ms of audio takes 2 seconds then this simply means the UI will only refresh every 2 seconds, taking the latest available audio. I did not myself notice any negative change in responsiveness. Yes, for long transcripts This might be a bit of an issue, but what's the worst case? The worst case is pinning the CPU at 100% while you're talking, because at most one transcript is running simultaneously. Ofc. the update interval could also be reduced the longer the text gets if 100% CPU isn't acceptable. |
|
I don't think you've considered the user experience overall
I didn't mean the text, I meant the icons and the general look and feel of it overall. It doesn't look or feel very good. Already the app is lacking in good design and we don't need to throw more bad design into the mix.
I don't know why you're even bringing this up. This is a speech to text application, not a text to speech application. We dont even support pocket-tts. So overall, based on what you've said, I'm led to believe this is a low quality contribution overall (you're talking nonsense about tts, you can't see low quality ui, and you think it's acceptable to just pin someone's CPU at 100% without any option to turn it off). And again, I made my very specific point. This may be considered as another option after my PR is pulled in. I don't have a lot of time right now and extra features and contributions that don't have specific community support are very unlikely to make it in and be pulled in by me. Right now the app is largely in a feature freeze, I'm going to be the only person pulling in new features for the time being. Anyone can submit new features without a lot of time and effort due to the AI tooling (1 hour is a minuscule amount of time, I've spent probably near 1000 or more hours supporting and building this app). I also use the AI tooling extensively, and I opened source the app so you can fork and add features for yourself! Because I am committed to making this a stable app for everyone. If you really want to help the application, the biggest help would be to be solving the outstanding issues on Linux, because I can't test every configuration very easily, so the more support I have from more users, the better. Right now we're having to focus on stability rather than features because the app is not fundamentally stable across all platforms. And because of this I cannot just keep pulling in new features all the time because it just makes the app more unstable. If you don't like my answer, please fork. Right now I'm committed to the stability of the application rather than supporting every single feature that can possibly be. Someone has to draw the line in the sand for the app and that's me. If you don't like my line, fork. "Your search for the right speech-to-text tool can end here—not because Handy is perfect, but because you can make it perfect for you." - I think my line generally supports this quote from the readme as I support forking the application so you can implement the features you want for yourself. If you really, really want this feature, go find someone to make a great UI for it. And I'll consider it. |
|
Honestly I mainly wrote this because I was confused why in all of the long discussions linked above about having some, any, real time feedback no one did anything about it, because for me (here) something is better than nothing, because it helps a lot with confidence in the program and with other bugs (sometimes the transcripts for me are just empty, sometimes USR2 does not get registered). It definitely at least requires a config flag, and real live transcription would obviously be better. I sent this as is because it makes no sense to spend hours and hours on something without getting any form of feedback on whether something would even be considered, and I prefer getting a prototype rather than getting a random feature demand as an issue. But yes, I just started using this project today, thank you for a great project! I understand the frustration of maintaining open source, and it's difficult to stay constructive when you care much more than the other person. I will consider spending more time on contributing if I'm still using this regularily some time in the future. |


Show a live transcript preview in the recording overlay while audio is being captured. A background thread periodically snapshots accumulated audio samples and runs them through the transcription engine, emitting updates to the overlay UI in real time.
It also requires no VAD compared to your @cjpais PR from three days ago.
I find this very helpful feedback to see which text is going to be entered.
This is much simpler than live inputting text into a real application because this has zero quality loss (no chunking) - it retroactively changes when the audio changes by simply processing the full thing in a 500ms loop ("fake" incremental processing).
This has basically no disadvantages apart from CPU use during talking.
Video:
2026-02-20-15-25-54.mp4
Before Submitting This PR
Please confirm you have done the following:
Human Written Description
Related Issues/Discussions
Fixes #
Discussion:
Chunking option for almost live processing #179
Community Feedback
AI Assistance
In total this was around ~60mins of human effort. I did a bunch of manual changes to simplify and because AI could not figure out the CSS.
AI Disclosure Summary:
The core feature was implemented across three sessions. The initial prompt was to show a live streaming transcript in the recording overlay popup. The AI planned the architecture, then implemented it end-to-end: a
snapshot()method on the recorder to clone accumulated audio mid-recording, a background thread that periodically transcribes the snapshot and emits events, and a React frontend that listens and displays the live text. Subsequent prompts were iterative refinements — fixing layout issues, debugging a race condition, and simplifying the code.snapshot()on the recorder, streaming loop inactions.rs, event emission to overlay, React listener + displayFiles:
actions.rs,recorder.rs,audio.rs,transcription.rs,overlay.rs,RecordingOverlay.tsx,RecordingOverlay.css,index.html