Skip to content

feat: support voice message STT (Speech-to-Text) for Discord #224

@chaodu-agent

Description

@chaodu-agent

Description

Add Speech-to-Text support so that Discord voice messages (audio attachments) are automatically transcribed and forwarded to downstream ACP agents as text input.

This is the next step after image attachment support (PR #158, #210) tracked in #161.

Use Cases

  1. Mobile users sending voice messages — Discord mobile users often prefer tapping the voice message button over typing, especially for longer questions or bug descriptions. Without STT, these messages are silently dropped.
  2. Accessibility — Users with motor disabilities, RSI, or situational impairments (driving, cooking) rely on voice input. STT makes the agent accessible to a wider audience.
  3. Multilingual teams — Voice messages capture tone and intent that may be lost in typed text, especially for non-native English speakers. Whisper supports 50+ languages out of the box.
  4. Quick context dumps — A 30-second voice note is faster than typing 200 words. With STT, the agent receives it as actionable text.
  5. Workflow continuity — In a team Discord server, a mix of text and voice messages is natural. Without STT, voice messages create "blind spots" where the agent misses context in threads.

Prior Art Investigation

We investigated how OpenClaw and Hermes Agent implement STT:

OpenClaw (openclaw/openclaw)

  • Core STT lives in src/media-understanding/ — a framework-level module, not plugin-specific.
  • audio-preflight.ts — transcribes audio attachments before mention checking, so voice notes work in group chats with requireMention: true.
  • audio-transcription-runner.ts — unified runAudioTranscription() entry point using a capability-based pipeline (runCapability("audio")).
  • openai-compatible-audio.ts — generic OpenAI-compatible /audio/transcriptions endpoint via FormData multipart POST.
  • Built-in providers (by priority): OpenAI (gpt-4o-transcribe), Groq (whisper-large-v3-turbo), Deepgram (nova-3), Google (gemini-3-flash-preview), Mistral (voxtral-mini-latest).
  • Includes a hallucination guard (transcript-policy.ts) to filter known Whisper subtitle-credit artifacts.

Hermes Agent (NousResearch/hermes-agent)

  • Core STT in tools/transcription_tools.py — standalone Python module.
  • Three providers with auto-fallback: local (faster-whisper, free, ~150MB model), Groq (free tier), OpenAI (paid).
  • Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, aac, flac.
  • Discord voice channel integration via VoiceReceiver class: RTP → NaCl decrypt → DAVE E2EE decrypt → Opus decode → PCM → WAV (ffmpeg 16kHz mono) → transcribe_audio().
  • Includes is_whisper_hallucination() guard for known Whisper artifacts.

Key Differences

Aspect OpenClaw Hermes Agent
Language TypeScript Python
Local STT No (API only) Yes (faster-whisper)
Providers 5 (OpenAI, Groq, Deepgram, Google, Mistral) 4 (local, Groq, OpenAI, Mistral)
Voice Channel capture No Yes (full RTP + realtime STT)
Preflight transcription Yes (before mention check) No
Hallucination guard Yes Yes

Proposed Design for openab

                         openab — Proposed STT Design
 ═══════════════════════════════════════════════════════════════════

  Discord User
       │
       │  sends voice message (.ogg opus)
       ▼
 ┌───────────┐
 │  Discord   │
 │  Gateway   │
 └─────┬─────┘
       │  msg.attachments[0].content_type = "audio/ogg"
       ▼
 ┌───────────────────────────────────────────────┐
 │              discord.rs                        │
 │                                                │
 │  ┌──────────────┐     ┌─────────────────────┐ │
 │  │ text content  │     │ audio/* attachment?  │ │
 │  │ (existing)    │     │                     │ │
 │  └──────┬───────┘     └────────┬────────────┘ │
 │         │                      │               │
 │         │               stt.enabled?           │
 │         │              ┌───┴───┐               │
 │         │             yes      no              │
 │         │              │       │               │
 │         │              ▼       ▼               │
 │         │         download   skip + warn       │
 │         │          .ogg                        │
 │         │              │                       │
 └─────────┼──────────────┼───────────────────────┘
           │              │
           │              ▼
           │   ┌─────────────────────┐
           │   │      stt.rs         │  (new module, ~60 lines)
           │   │                     │
           │   │  POST /audio/       │
           │   │   transcriptions    │──────► Groq API (free tier)
           │   │                     │        whisper-large-v3-turbo
           │   │  FormData:          │
           │   │   file: audio.ogg   │◄────── { "text": "..." }
           │   │   model: whisper-*  │
           │   └────────┬────────────┘
           │            │
           │            │ transcript string
           │            ▼
           │   "[Voice transcript]: hello, I have a bug..."
           │            │
           ▼            ▼
 ┌──────────────────────────────────────┐
 │         content_blocks: Vec          │
 │                                      │
 │  [0] ContentBlock::Text {            │
 │        "<sender_context>..."         │
 │        "[Voice transcript]: ..."     │  ◄── injected here
 │        + original text (if any)      │
 │      }                               │
 │  [1] ContentBlock::Image { ... }     │  ◄── if image attached
 │                                      │
 └──────────────────┬───────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────┐
 │          ACP Connection              │
 │     session_prompt(content_blocks)   │
 │                                      │
 │  JSON-RPC → "prompt": [             │
 │    {"type":"text", "text":"..."},    │
 │    {"type":"image", ...}             │
 │  ]                                   │
 └──────────────────┬───────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────┐
 │        Downstream ACP Agent          │
 │     (Kiro CLI / Claude Code / etc)   │
 │                                      │
 │  Receives voice as plain text ✅     │
 └──────────────────────────────────────┘

Configuration

stt:                                # optional — omit to disable
  enabled: true
  api_key: ${GROQ_API_KEY}
  model: whisper-large-v3-turbo
  # base_url: https://api.groq.com/openai/v1   # default for groq

Deployment Options

Since openab uses a generic OpenAI-compatible /audio/transcriptions endpoint, the same code supports cloud APIs, local whisper servers, and self-hosted solutions — all via the base_url config field.

 ┌─────────────────────────────────────────────────────────────────┐
 │                    openab  (stt.rs)                              │
 │              POST {base_url}/audio/transcriptions                │
 └────────┬──────────────┬──────────────────┬──────────────────────┘
          │              │                  │
          ▼              ▼                  ▼
 ┌──────────────┐ ┌─────────────┐ ┌─────────────────────────┐
 │  Cloud API   │ │ Local Server│ │  LAN / Sidecar Server   │
 └──────────────┘ └─────────────┘ └─────────────────────────┘
Option base_url Cost Latency Privacy Setup
Groq Cloud (recommended) https://api.groq.com/openai/v1 Free tier (rate limited) ~1-2s Audio sent to Groq Just set api_key
OpenAI Cloud https://api.openai.com/v1 ~$0.006/min ~1-2s Audio sent to OpenAI Set api_key
Local whisper server (Mac Mini, home lab) http://localhost:8080/v1 Free ~2-5s (CPU) / ~1s (Apple Silicon) Audio stays local ✅ Run a local server (see below)
LAN / sidecar server http://192.168.x.x:8080/v1 Free ~1-3s Audio stays in network ✅ Run server on another machine
Ollama ❌ Not supported Ollama does not expose /audio/transcriptions

Local Whisper Server Options (for Mac Mini / home lab users)

Users who already run or want to run a local whisper server can point openab at it directly:

Server Install Apple Silicon GPU OpenAI-compatible
faster-whisper-server pip install faster-whisper-server ✅ CoreML ✅ CUDA
whisper.cpp server brew install whisper-cpp ✅ Metal ✅ CUDA
LocalAI Docker or binary ✅ CUDA

Example config for a local whisper server on the same Mac Mini:

stt:
  enabled: true
  base_url: http://localhost:8080/v1
  api_key: "not-needed"
  model: large-v3-turbo

Files Changed

File Change Description
src/stt.rs NEW ~60 lines — HTTP POST to /audio/transcriptions
src/discord.rs MOD Detect audio attachment → call STT → inject transcript
src/config.rs MOD Add SttConfig struct
src/main.rs MOD Wire SttConfig

Implementation Steps

  1. Discord handler (src/discord.rs): Detect audio/* attachments (Discord voice messages are .ogg Opus), download the file.
  2. STT module (src/stt.rs): Call an OpenAI-compatible /audio/transcriptions endpoint (Groq free tier with whisper-large-v3-turbo is the most practical default).
  3. ACP prompt injection: Prepend the transcript to the ACP prompt text content, e.g. [Voice message transcript]: <text>.
  4. Configuration: Add optional stt section to config. No config = feature disabled, zero impact.
  5. Hallucination guard (optional): Filter known Whisper artifacts like subtitle credits.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurep1High — address this sprintstt

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions