feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177
Open
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Open
feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Conversation
When routing.voice = "whisper-local://<spec>", audio attachments are transcribed locally instead of via the LLM provider HTTP path. <spec> is either: - A known size name (tiny/base/small/medium/large) — fetched from ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing HF cache if already present - An absolute path to a GGML model file The WhisperContext is loaded once and cached in a OnceLock for the process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is handled by symphonia with linear resampling to 16 kHz mono f32. All three deps (whisper-rs, hf-hub, symphonia) are optional behind the stt-whisper feature flag.
139bb52 to
d9b8764
Compare
Workers can now call transcribe_audio(path) to transcribe local audio files. The tool uses whatever is configured in routing.voice — local Whisper (whisper-local://<spec>) or any OpenAI-compatible HTTP provider. The transcription logic is extracted from channel.rs into stt.rs as transcribe_bytes(), shared by both the channel attachment handler and the new tool. The stt module is now always compiled (not gated on stt-whisper) since it handles all provider paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-submission of #105, which was merged into the
voicebranch before that branch landed inmain— the commits never reachedmain.Local Whisper backend
Set
routing.voice = "whisper-local://<spec>"in config. When the channel processes an audio attachment, it bypasses the HTTP provider path and runs inference locally via whisper-rs.<spec>is a known size name (tiny,base,small,medium,large,large-v3) — downloaded fromggerganov/whisper.cppon HuggingFace — or an absolute path to a GGML model file. TheWhisperContextis cached for the process lifetime; switching models requires a restart.Audio decoding: Ogg/Opus (Telegram voice messages) is handled via the
ogg+opuscrates. All other formats fall through to symphonia. Both paths resample to 16 kHz mono f32 before Whisper.GPU acceleration via the
vulkanfeature (CUDA excluded due to GCC 14+/nvcc incompatibility on modern distros).Everything is behind the
stt-whispercargo feature. Builds without it are unaffected.STT unification + transcribe_audio worker tool
The transcription logic is extracted from
channel.rsinto a sharedstt::transcribe_bytes()function that handles both the local Whisper path and any OpenAI-compatible HTTP provider. Workers now get atranscribe_audiotool that reads a local audio file and transcribes it using whatever is configured inrouting.voice— no need to shell out to the whisper CLI.