Description
Add Speech-to-Text support so that Discord voice messages (audio attachments) are automatically transcribed and forwarded to downstream ACP agents as text input.
This is the next step after image attachment support (PR #158, #210) tracked in #161.
Use Cases
- Mobile users sending voice messages — Discord mobile users often prefer tapping the voice message button over typing, especially for longer questions or bug descriptions. Without STT, these messages are silently dropped.
- Accessibility — Users with motor disabilities, RSI, or situational impairments (driving, cooking) rely on voice input. STT makes the agent accessible to a wider audience.
- Multilingual teams — Voice messages capture tone and intent that may be lost in typed text, especially for non-native English speakers. Whisper supports 50+ languages out of the box.
- Quick context dumps — A 30-second voice note is faster than typing 200 words. With STT, the agent receives it as actionable text.
- Workflow continuity — In a team Discord server, a mix of text and voice messages is natural. Without STT, voice messages create "blind spots" where the agent misses context in threads.
Prior Art Investigation
We investigated how OpenClaw and Hermes Agent implement STT:
OpenClaw (openclaw/openclaw)
- Core STT lives in
src/media-understanding/ — a framework-level module, not plugin-specific.
audio-preflight.ts — transcribes audio attachments before mention checking, so voice notes work in group chats with requireMention: true.
audio-transcription-runner.ts — unified runAudioTranscription() entry point using a capability-based pipeline (runCapability("audio")).
openai-compatible-audio.ts — generic OpenAI-compatible /audio/transcriptions endpoint via FormData multipart POST.
- Built-in providers (by priority): OpenAI (
gpt-4o-transcribe), Groq (whisper-large-v3-turbo), Deepgram (nova-3), Google (gemini-3-flash-preview), Mistral (voxtral-mini-latest).
- Includes a hallucination guard (
transcript-policy.ts) to filter known Whisper subtitle-credit artifacts.
Hermes Agent (NousResearch/hermes-agent)
- Core STT in
tools/transcription_tools.py — standalone Python module.
- Three providers with auto-fallback: local (
faster-whisper, free, ~150MB model), Groq (free tier), OpenAI (paid).
- Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, aac, flac.
- Discord voice channel integration via
VoiceReceiver class: RTP → NaCl decrypt → DAVE E2EE decrypt → Opus decode → PCM → WAV (ffmpeg 16kHz mono) → transcribe_audio().
- Includes
is_whisper_hallucination() guard for known Whisper artifacts.
Key Differences
| Aspect |
OpenClaw |
Hermes Agent |
| Language |
TypeScript |
Python |
| Local STT |
No (API only) |
Yes (faster-whisper) |
| Providers |
5 (OpenAI, Groq, Deepgram, Google, Mistral) |
4 (local, Groq, OpenAI, Mistral) |
| Voice Channel capture |
No |
Yes (full RTP + realtime STT) |
| Preflight transcription |
Yes (before mention check) |
No |
| Hallucination guard |
Yes |
Yes |
Proposed Design for openab
openab — Proposed STT Design
═══════════════════════════════════════════════════════════════════
Discord User
│
│ sends voice message (.ogg opus)
▼
┌───────────┐
│ Discord │
│ Gateway │
└─────┬─────┘
│ msg.attachments[0].content_type = "audio/ogg"
▼
┌───────────────────────────────────────────────┐
│ discord.rs │
│ │
│ ┌──────────────┐ ┌─────────────────────┐ │
│ │ text content │ │ audio/* attachment? │ │
│ │ (existing) │ │ │ │
│ └──────┬───────┘ └────────┬────────────┘ │
│ │ │ │
│ │ stt.enabled? │
│ │ ┌───┴───┐ │
│ │ yes no │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ download skip + warn │
│ │ .ogg │
│ │ │ │
└─────────┼──────────────┼───────────────────────┘
│ │
│ ▼
│ ┌─────────────────────┐
│ │ stt.rs │ (new module, ~60 lines)
│ │ │
│ │ POST /audio/ │
│ │ transcriptions │──────► Groq API (free tier)
│ │ │ whisper-large-v3-turbo
│ │ FormData: │
│ │ file: audio.ogg │◄────── { "text": "..." }
│ │ model: whisper-* │
│ └────────┬────────────┘
│ │
│ │ transcript string
│ ▼
│ "[Voice transcript]: hello, I have a bug..."
│ │
▼ ▼
┌──────────────────────────────────────┐
│ content_blocks: Vec │
│ │
│ [0] ContentBlock::Text { │
│ "<sender_context>..." │
│ "[Voice transcript]: ..." │ ◄── injected here
│ + original text (if any) │
│ } │
│ [1] ContentBlock::Image { ... } │ ◄── if image attached
│ │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ ACP Connection │
│ session_prompt(content_blocks) │
│ │
│ JSON-RPC → "prompt": [ │
│ {"type":"text", "text":"..."}, │
│ {"type":"image", ...} │
│ ] │
└──────────────────┬───────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Downstream ACP Agent │
│ (Kiro CLI / Claude Code / etc) │
│ │
│ Receives voice as plain text ✅ │
└──────────────────────────────────────┘
Configuration
stt: # optional — omit to disable
enabled: true
api_key: ${GROQ_API_KEY}
model: whisper-large-v3-turbo
# base_url: https://api.groq.com/openai/v1 # default for groq
Deployment Options
Since openab uses a generic OpenAI-compatible /audio/transcriptions endpoint, the same code supports cloud APIs, local whisper servers, and self-hosted solutions — all via the base_url config field.
┌─────────────────────────────────────────────────────────────────┐
│ openab (stt.rs) │
│ POST {base_url}/audio/transcriptions │
└────────┬──────────────┬──────────────────┬──────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌─────────────────────────┐
│ Cloud API │ │ Local Server│ │ LAN / Sidecar Server │
└──────────────┘ └─────────────┘ └─────────────────────────┘
| Option |
base_url |
Cost |
Latency |
Privacy |
Setup |
| Groq Cloud (recommended) |
https://api.groq.com/openai/v1 |
Free tier (rate limited) |
~1-2s |
Audio sent to Groq |
Just set api_key |
| OpenAI Cloud |
https://api.openai.com/v1 |
~$0.006/min |
~1-2s |
Audio sent to OpenAI |
Set api_key |
| Local whisper server (Mac Mini, home lab) |
http://localhost:8080/v1 |
Free |
~2-5s (CPU) / ~1s (Apple Silicon) |
Audio stays local ✅ |
Run a local server (see below) |
| LAN / sidecar server |
http://192.168.x.x:8080/v1 |
Free |
~1-3s |
Audio stays in network ✅ |
Run server on another machine |
| Ollama |
❌ Not supported |
— |
— |
— |
Ollama does not expose /audio/transcriptions |
Local Whisper Server Options (for Mac Mini / home lab users)
Users who already run or want to run a local whisper server can point openab at it directly:
Example config for a local whisper server on the same Mac Mini:
stt:
enabled: true
base_url: http://localhost:8080/v1
api_key: "not-needed"
model: large-v3-turbo
Files Changed
| File |
Change |
Description |
src/stt.rs |
NEW |
~60 lines — HTTP POST to /audio/transcriptions |
src/discord.rs |
MOD |
Detect audio attachment → call STT → inject transcript |
src/config.rs |
MOD |
Add SttConfig struct |
src/main.rs |
MOD |
Wire SttConfig |
Implementation Steps
- Discord handler (
src/discord.rs): Detect audio/* attachments (Discord voice messages are .ogg Opus), download the file.
- STT module (
src/stt.rs): Call an OpenAI-compatible /audio/transcriptions endpoint (Groq free tier with whisper-large-v3-turbo is the most practical default).
- ACP prompt injection: Prepend the transcript to the ACP prompt text content, e.g.
[Voice message transcript]: <text>.
- Configuration: Add optional
stt section to config. No config = feature disabled, zero impact.
- Hallucination guard (optional): Filter known Whisper artifacts like subtitle credits.
References
Description
Add Speech-to-Text support so that Discord voice messages (audio attachments) are automatically transcribed and forwarded to downstream ACP agents as text input.
This is the next step after image attachment support (PR #158, #210) tracked in #161.
Use Cases
Prior Art Investigation
We investigated how OpenClaw and Hermes Agent implement STT:
OpenClaw (
openclaw/openclaw)src/media-understanding/— a framework-level module, not plugin-specific.audio-preflight.ts— transcribes audio attachments before mention checking, so voice notes work in group chats withrequireMention: true.audio-transcription-runner.ts— unifiedrunAudioTranscription()entry point using a capability-based pipeline (runCapability("audio")).openai-compatible-audio.ts— generic OpenAI-compatible/audio/transcriptionsendpoint viaFormDatamultipart POST.gpt-4o-transcribe), Groq (whisper-large-v3-turbo), Deepgram (nova-3), Google (gemini-3-flash-preview), Mistral (voxtral-mini-latest).transcript-policy.ts) to filter known Whisper subtitle-credit artifacts.Hermes Agent (
NousResearch/hermes-agent)tools/transcription_tools.py— standalone Python module.faster-whisper, free, ~150MB model), Groq (free tier), OpenAI (paid).VoiceReceiverclass: RTP → NaCl decrypt → DAVE E2EE decrypt → Opus decode → PCM → WAV (ffmpeg 16kHz mono) →transcribe_audio().is_whisper_hallucination()guard for known Whisper artifacts.Key Differences
faster-whisper)Proposed Design for openab
Configuration
Deployment Options
Since openab uses a generic OpenAI-compatible
/audio/transcriptionsendpoint, the same code supports cloud APIs, local whisper servers, and self-hosted solutions — all via thebase_urlconfig field.base_urlhttps://api.groq.com/openai/v1api_keyhttps://api.openai.com/v1api_keyhttp://localhost:8080/v1http://192.168.x.x:8080/v1/audio/transcriptionsLocal Whisper Server Options (for Mac Mini / home lab users)
Users who already run or want to run a local whisper server can point openab at it directly:
faster-whisper-serverpip install faster-whisper-serverwhisper.cpp serverbrew install whisper-cppLocalAIExample config for a local whisper server on the same Mac Mini:
Files Changed
src/stt.rs/audio/transcriptionssrc/discord.rssrc/config.rsSttConfigstructsrc/main.rsSttConfigImplementation Steps
src/discord.rs): Detectaudio/*attachments (Discord voice messages are.oggOpus), download the file.src/stt.rs): Call an OpenAI-compatible/audio/transcriptionsendpoint (Groq free tier withwhisper-large-v3-turbois the most practical default).[Voice message transcript]: <text>.sttsection to config. No config = feature disabled, zero impact.References
src/media-understanding/audio-transcription-runner.ts,audio-preflight.ts,openai-compatible-audio.tstools/transcription_tools.py,gateway/platforms/discord.py