Skip to content

Comments

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177

Open
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz:feat/stt-local-v2
Open

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz:feat/stt-local-v2

Conversation

@Marenz
Copy link
Collaborator

@Marenz Marenz commented Feb 23, 2026

Re-submission of #105, which was merged into the voice branch before that branch landed in main — the commits never reached main.

Local Whisper backend

Set routing.voice = "whisper-local://<spec>" in config. When the channel processes an audio attachment, it bypasses the HTTP provider path and runs inference locally via whisper-rs.

<spec> is a known size name (tiny, base, small, medium, large, large-v3) — downloaded from ggerganov/whisper.cpp on HuggingFace — or an absolute path to a GGML model file. The WhisperContext is cached for the process lifetime; switching models requires a restart.

Audio decoding: Ogg/Opus (Telegram voice messages) is handled via the ogg + opus crates. All other formats fall through to symphonia. Both paths resample to 16 kHz mono f32 before Whisper.

GPU acceleration via the vulkan feature (CUDA excluded due to GCC 14+/nvcc incompatibility on modern distros).

Everything is behind the stt-whisper cargo feature. Builds without it are unaffected.

STT unification + transcribe_audio worker tool

The transcription logic is extracted from channel.rs into a shared stt::transcribe_bytes() function that handles both the local Whisper path and any OpenAI-compatible HTTP provider. Workers now get a transcribe_audio tool that reads a local audio file and transcribes it using whatever is configured in routing.voice — no need to shell out to the whisper CLI.

When routing.voice = "whisper-local://<spec>", audio attachments are
transcribed locally instead of via the LLM provider HTTP path.

<spec> is either:
- A known size name (tiny/base/small/medium/large) — fetched from
  ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing
  HF cache if already present
- An absolute path to a GGML model file

The WhisperContext is loaded once and cached in a OnceLock for the
process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is
handled by symphonia with linear resampling to 16 kHz mono f32.

All three deps (whisper-rs, hf-hub, symphonia) are optional behind the
stt-whisper feature flag.
Workers can now call transcribe_audio(path) to transcribe local audio
files. The tool uses whatever is configured in routing.voice — local
Whisper (whisper-local://<spec>) or any OpenAI-compatible HTTP provider.

The transcription logic is extracted from channel.rs into stt.rs as
transcribe_bytes(), shared by both the channel attachment handler and
the new tool. The stt module is now always compiled (not gated on
stt-whisper) since it handles all provider paths.
@Marenz Marenz changed the title feat(stt): local Whisper transcription backend via whisper-rs feat(stt): local Whisper transcription backend + transcribe_audio worker tool Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant