feat: support voice message STT (Speech-to-Text) for Discord

## Description

Add Speech-to-Text support so that Discord voice messages (audio attachments) are automatically transcribed and forwarded to downstream ACP agents as text input.

This is the next step after image attachment support (PR #158, #210) tracked in #161.

## Use Cases

1. **Mobile users sending voice messages** — Discord mobile users often prefer tapping the voice message button over typing, especially for longer questions or bug descriptions. Without STT, these messages are silently dropped.
2. **Accessibility** — Users with motor disabilities, RSI, or situational impairments (driving, cooking) rely on voice input. STT makes the agent accessible to a wider audience.
3. **Multilingual teams** — Voice messages capture tone and intent that may be lost in typed text, especially for non-native English speakers. Whisper supports 50+ languages out of the box.
4. **Quick context dumps** — A 30-second voice note is faster than typing 200 words. With STT, the agent receives it as actionable text.
5. **Workflow continuity** — In a team Discord server, a mix of text and voice messages is natural. Without STT, voice messages create "blind spots" where the agent misses context in threads.

## Prior Art Investigation

We investigated how **OpenClaw** and **Hermes Agent** implement STT:

### OpenClaw (`openclaw/openclaw`)

- Core STT lives in `src/media-understanding/` — a framework-level module, not plugin-specific.
- **`audio-preflight.ts`** — transcribes audio attachments *before* mention checking, so voice notes work in group chats with `requireMention: true`.
- **`audio-transcription-runner.ts`** — unified `runAudioTranscription()` entry point using a capability-based pipeline (`runCapability("audio")`).
- **`openai-compatible-audio.ts`** — generic OpenAI-compatible `/audio/transcriptions` endpoint via `FormData` multipart POST.
- Built-in providers (by priority): OpenAI (`gpt-4o-transcribe`), Groq (`whisper-large-v3-turbo`), Deepgram (`nova-3`), Google (`gemini-3-flash-preview`), Mistral (`voxtral-mini-latest`).
- Includes a hallucination guard (`transcript-policy.ts`) to filter known Whisper subtitle-credit artifacts.

### Hermes Agent (`NousResearch/hermes-agent`)

- Core STT in `tools/transcription_tools.py` — standalone Python module.
- Three providers with auto-fallback: **local** (`faster-whisper`, free, ~150MB model), **Groq** (free tier), **OpenAI** (paid).
- Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, aac, flac.
- Discord voice channel integration via `VoiceReceiver` class: RTP → NaCl decrypt → DAVE E2EE decrypt → Opus decode → PCM → WAV (ffmpeg 16kHz mono) → `transcribe_audio()`.
- Includes `is_whisper_hallucination()` guard for known Whisper artifacts.

### Key Differences

| Aspect | OpenClaw | Hermes Agent |
|---|---|---|
| Language | TypeScript | Python |
| Local STT | No (API only) | Yes (`faster-whisper`) |
| Providers | 5 (OpenAI, Groq, Deepgram, Google, Mistral) | 4 (local, Groq, OpenAI, Mistral) |
| Voice Channel capture | No | Yes (full RTP + realtime STT) |
| Preflight transcription | Yes (before mention check) | No |
| Hallucination guard | Yes | Yes |

## Proposed Design for openab

```
                         openab — Proposed STT Design
 ═══════════════════════════════════════════════════════════════════

  Discord User
       │
       │  sends voice message (.ogg opus)
       ▼
 ┌───────────┐
 │  Discord   │
 │  Gateway   │
 └─────┬─────┘
       │  msg.attachments[0].content_type = "audio/ogg"
       ▼
 ┌───────────────────────────────────────────────┐
 │              discord.rs                        │
 │                                                │
 │  ┌──────────────┐     ┌─────────────────────┐ │
 │  │ text content  │     │ audio/* attachment?  │ │
 │  │ (existing)    │     │                     │ │
 │  └──────┬───────┘     └────────┬────────────┘ │
 │         │                      │               │
 │         │               stt.enabled?           │
 │         │              ┌───┴───┐               │
 │         │             yes      no              │
 │         │              │       │               │
 │         │              ▼       ▼               │
 │         │         download   skip + warn       │
 │         │          .ogg                        │
 │         │              │                       │
 └─────────┼──────────────┼───────────────────────┘
           │              │
           │              ▼
           │   ┌─────────────────────┐
           │   │      stt.rs         │  (new module, ~60 lines)
           │   │                     │
           │   │  POST /audio/       │
           │   │   transcriptions    │──────► Groq API (free tier)
           │   │                     │        whisper-large-v3-turbo
           │   │  FormData:          │
           │   │   file: audio.ogg   │◄────── { "text": "..." }
           │   │   model: whisper-*  │
           │   └────────┬────────────┘
           │            │
           │            │ transcript string
           │            ▼
           │   "[Voice transcript]: hello, I have a bug..."
           │            │
           ▼            ▼
 ┌──────────────────────────────────────┐
 │         content_blocks: Vec          │
 │                                      │
 │  [0] ContentBlock::Text {            │
 │        "<sender_context>..."         │
 │        "[Voice transcript]: ..."     │  ◄── injected here
 │        + original text (if any)      │
 │      }                               │
 │  [1] ContentBlock::Image { ... }     │  ◄── if image attached
 │                                      │
 └──────────────────┬───────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────┐
 │          ACP Connection              │
 │     session_prompt(content_blocks)   │
 │                                      │
 │  JSON-RPC → "prompt": [             │
 │    {"type":"text", "text":"..."},    │
 │    {"type":"image", ...}             │
 │  ]                                   │
 └──────────────────┬───────────────────┘
                    │
                    ▼
 ┌──────────────────────────────────────┐
 │        Downstream ACP Agent          │
 │     (Kiro CLI / Claude Code / etc)   │
 │                                      │
 │  Receives voice as plain text ✅     │
 └──────────────────────────────────────┘
```

### Configuration

```yaml
stt:                                # optional — omit to disable
  enabled: true
  api_key: ${GROQ_API_KEY}
  model: whisper-large-v3-turbo
  # base_url: https://api.groq.com/openai/v1   # default for groq
```

### Deployment Options

Since openab uses a generic OpenAI-compatible `/audio/transcriptions` endpoint, the same code supports cloud APIs, local whisper servers, and self-hosted solutions — all via the `base_url` config field.

```
 ┌─────────────────────────────────────────────────────────────────┐
 │                    openab  (stt.rs)                              │
 │              POST {base_url}/audio/transcriptions                │
 └────────┬──────────────┬──────────────────┬──────────────────────┘
          │              │                  │
          ▼              ▼                  ▼
 ┌──────────────┐ ┌─────────────┐ ┌─────────────────────────┐
 │  Cloud API   │ │ Local Server│ │  LAN / Sidecar Server   │
 └──────────────┘ └─────────────┘ └─────────────────────────┘
```

| Option | `base_url` | Cost | Latency | Privacy | Setup |
|---|---|---|---|---|---|
| **Groq Cloud** (recommended) | `https://api.groq.com/openai/v1` | Free tier (rate limited) | ~1-2s | Audio sent to Groq | Just set `api_key` |
| **OpenAI Cloud** | `https://api.openai.com/v1` | ~$0.006/min | ~1-2s | Audio sent to OpenAI | Set `api_key` |
| **Local whisper server** (Mac Mini, home lab) | `http://localhost:8080/v1` | Free | ~2-5s (CPU) / ~1s (Apple Silicon) | Audio stays local ✅ | Run a local server (see below) |
| **LAN / sidecar server** | `http://192.168.x.x:8080/v1` | Free | ~1-3s | Audio stays in network ✅ | Run server on another machine |
| **Ollama** | ❌ Not supported | — | — | — | Ollama does not expose `/audio/transcriptions` |

#### Local Whisper Server Options (for Mac Mini / home lab users)

Users who already run or want to run a local whisper server can point openab at it directly:

| Server | Install | Apple Silicon | GPU | OpenAI-compatible |
|---|---|---|---|---|
| [`faster-whisper-server`](https://github.com/fedirz/faster-whisper-server) | `pip install faster-whisper-server` | ✅ CoreML | ✅ CUDA | ✅ |
| [`whisper.cpp server`](https://github.com/ggerganov/whisper.cpp) | `brew install whisper-cpp` | ✅ Metal | ✅ CUDA | ✅ |
| [`LocalAI`](https://github.com/mudler/LocalAI) | Docker or binary | ✅ | ✅ CUDA | ✅ |

Example config for a local whisper server on the same Mac Mini:

```yaml
stt:
  enabled: true
  base_url: http://localhost:8080/v1
  api_key: "not-needed"
  model: large-v3-turbo
```

### Files Changed

| File | Change | Description |
|---|---|---|
| `src/stt.rs` | NEW | ~60 lines — HTTP POST to `/audio/transcriptions` |
| `src/discord.rs` | MOD | Detect audio attachment → call STT → inject transcript |
| `src/config.rs` | MOD | Add `SttConfig` struct |
| `src/main.rs` | MOD | Wire `SttConfig` |

### Implementation Steps

1. **Discord handler (`src/discord.rs`)**: Detect `audio/*` attachments (Discord voice messages are `.ogg` Opus), download the file.
2. **STT module (`src/stt.rs`)**: Call an OpenAI-compatible `/audio/transcriptions` endpoint (Groq free tier with `whisper-large-v3-turbo` is the most practical default).
3. **ACP prompt injection**: Prepend the transcript to the ACP prompt text content, e.g. `[Voice message transcript]: <text>`.
4. **Configuration**: Add optional `stt` section to config. No config = feature disabled, zero impact.
5. **Hallucination guard** (optional): Filter known Whisper artifacts like subtitle credits.

## References

- OpenClaw core STT: `src/media-understanding/audio-transcription-runner.ts`, `audio-preflight.ts`, `openai-compatible-audio.ts`
- Hermes Agent STT: `tools/transcription_tools.py`, `gateway/platforms/discord.py`
- Related: #161 (multimodal input tracking issue)
- Image support already merged: #158, #210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support voice message STT (Speech-to-Text) for Discord #224

Description

Use Cases

Prior Art Investigation

OpenClaw (`openclaw/openclaw`)

Hermes Agent (`NousResearch/hermes-agent`)

Key Differences

Proposed Design for openab

Configuration

Deployment Options

Local Whisper Server Options (for Mac Mini / home lab users)

Files Changed

Implementation Steps

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	OpenClaw	Hermes Agent
Language	TypeScript	Python
Local STT	No (API only)	Yes (`faster-whisper`)
Providers	5 (OpenAI, Groq, Deepgram, Google, Mistral)	4 (local, Groq, OpenAI, Mistral)
Voice Channel capture	No	Yes (full RTP + realtime STT)
Preflight transcription	Yes (before mention check)	No
Hallucination guard	Yes	Yes

Option	`base_url`	Cost	Latency	Privacy	Setup
Groq Cloud (recommended)	`https://api.groq.com/openai/v1`	Free tier (rate limited)	~1-2s	Audio sent to Groq	Just set `api_key`
OpenAI Cloud	`https://api.openai.com/v1`	~$0.006/min	~1-2s	Audio sent to OpenAI	Set `api_key`
Local whisper server (Mac Mini, home lab)	`http://localhost:8080/v1`	Free	~2-5s (CPU) / ~1s (Apple Silicon)	Audio stays local ✅	Run a local server (see below)
LAN / sidecar server	`http://192.168.x.x:8080/v1`	Free	~1-3s	Audio stays in network ✅	Run server on another machine
Ollama	❌ Not supported	—	—	—	Ollama does not expose `/audio/transcriptions`

Server	Install	Apple Silicon	GPU	OpenAI-compatible
`faster-whisper-server`	`pip install faster-whisper-server`	✅ CoreML	✅ CUDA	✅
`whisper.cpp server`	`brew install whisper-cpp`	✅ Metal	✅ CUDA	✅
`LocalAI`	Docker or binary	✅	✅ CUDA	✅

File	Change	Description
`src/stt.rs`	NEW	~60 lines — HTTP POST to `/audio/transcriptions`
`src/discord.rs`	MOD	Detect audio attachment → call STT → inject transcript
`src/config.rs`	MOD	Add `SttConfig` struct
`src/main.rs`	MOD	Wire `SttConfig`

feat: support voice message STT (Speech-to-Text) for Discord #224

Description

Description

Use Cases

Prior Art Investigation

OpenClaw (openclaw/openclaw)

Hermes Agent (NousResearch/hermes-agent)

Key Differences

Proposed Design for openab

Configuration

Deployment Options

Local Whisper Server Options (for Mac Mini / home lab users)

Files Changed

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

OpenClaw (`openclaw/openclaw`)

Hermes Agent (`NousResearch/hermes-agent`)