feat(stt): local Whisper transcription backend + transcribe_audio worker tool by Marenz · Pull Request #177 · spacedriveapp/spacebot

Marenz · 2026-02-23T20:20:57Z

Re-submission of #105, which was merged into the voice branch before that branch landed in main — the commits never reached main.

Local Whisper backend

Set routing.voice = "whisper-local://<spec>" in config. When the channel processes an audio attachment, it bypasses the HTTP provider path and runs inference locally via whisper-rs.

<spec> is a known size name (tiny, base, small, medium, large, large-v3) — downloaded from ggerganov/whisper.cpp on HuggingFace — or an absolute path to a GGML model file. The WhisperContext is cached for the process lifetime; switching models requires a restart.

Audio decoding: Ogg/Opus (Telegram voice messages) is handled via the ogg + opus crates. All other formats fall through to symphonia. Both paths resample to 16 kHz mono f32 before Whisper.

GPU acceleration via the vulkan feature (CUDA excluded due to GCC 14+/nvcc incompatibility on modern distros).

Everything is behind the stt-whisper cargo feature. Builds without it are unaffected.

STT unification + transcribe_audio worker tool

The transcription logic is extracted from channel.rs into a shared stt::transcribe_bytes() function that handles both the local Whisper path and any OpenAI-compatible HTTP provider. Workers now get a transcribe_audio tool that reads a local audio file and transcribes it using whatever is configured in routing.voice — no need to shell out to the whisper CLI.

When routing.voice = "whisper-local://<spec>", audio attachments are transcribed locally instead of via the LLM provider HTTP path. <spec> is either: - A known size name (tiny/base/small/medium/large) — fetched from ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing HF cache if already present - An absolute path to a GGML model file The WhisperContext is loaded once and cached in a OnceLock for the process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is handled by symphonia with linear resampling to 16 kHz mono f32. All three deps (whisper-rs, hf-hub, symphonia) are optional behind the stt-whisper feature flag.

Workers can now call transcribe_audio(path) to transcribe local audio files. The tool uses whatever is configured in routing.voice — local Whisper (whisper-local://<spec>) or any OpenAI-compatible HTTP provider. The transcription logic is extracted from channel.rs into stt.rs as transcribe_bytes(), shared by both the channel attachment handler and the new tool. The stt module is now always compiled (not gated on stt-whisper) since it handles all provider paths.

Marenz added 3 commits February 23, 2026 21:12

Enable Vulkan GPU backend and Ogg/Opus decode for local Whisper STT

c85c922

docs: document local Whisper STT backend in README

d9b8764

Marenz force-pushed the feat/stt-local-v2 branch from 139bb52 to d9b8764 Compare February 23, 2026 20:23

Marenz changed the title ~~feat(stt): local Whisper transcription backend via whisper-rs~~ feat(stt): local Whisper transcription backend + transcribe_audio worker tool Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177

feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177
Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Marenz:feat/stt-local-v2

Marenz commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Marenz commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Marenz commented Feb 23, 2026 •

edited

Loading