Skip to content

feat: support voice message STT (Speech-to-Text) for Discord#225

Merged
thepagent merged 8 commits intoopenabdev:mainfrom
chaodu-agent:feat/stt-voice-message
Apr 11, 2026
Merged

feat: support voice message STT (Speech-to-Text) for Discord#225
thepagent merged 8 commits intoopenabdev:mainfrom
chaodu-agent:feat/stt-voice-message

Conversation

@chaodu-agent
Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent commented Apr 11, 2026

Summary

Add optional Speech-to-Text support that transcribes Discord voice message attachments via any OpenAI-compatible /audio/transcriptions endpoint and injects the transcript into the ACP prompt as text.

Closes #224

Changes

File Change Description
src/stt.rs NEW ~50 lines — HTTP POST to /audio/transcriptions via reqwest multipart
src/config.rs MOD Add SttConfig struct (enabled, api_key, model, base_url)
src/discord.rs MOD Detect audio/* attachments → download → call STT → inject transcript
src/main.rs MOD Wire mod stt + pass config to Handler
Cargo.toml MOD Add multipart and json features to reqwest

How it works

Discord voice msg (.ogg)
       │
       ▼
  download audio
       │
       ▼
  POST /audio/transcriptions ──► Groq / OpenAI / local whisper server
       │
       ▼
  inject "[Voice message transcript]: ..." as ContentBlock::Text
       │
       ▼
  ACP agent receives plain text ✅

Configuration

Feature is opt-in: disabled by default, zero impact when [stt] section is omitted.

Cloud (Groq free tier — default)

[stt]
enabled = true
api_key = "${GROQ_API_KEY}"
model = "whisper-large-v3-turbo"
# base_url = "https://api.groq.com/openai/v1"  # default

Local whisper server (Mac Mini / home lab)

For users already running a local whisper server (faster-whisper-server, whisper.cpp, LocalAI, etc.), just point base_url to it:

[stt]
enabled = true
api_key = "not-needed"
model = "large-v3-turbo"
base_url = "http://localhost:8080/v1"

No code changes needed — same OpenAI-compatible /audio/transcriptions endpoint, different URL.

All supported deployment options

Option base_url Cost
Groq Cloud (default) https://api.groq.com/openai/v1 Free tier
OpenAI https://api.openai.com/v1 ~$0.006/min
Local whisper server http://localhost:8080/v1 Free
LAN / sidecar http://192.168.x.x:8080/v1 Free

Add optional STT support that transcribes Discord voice message
attachments (audio/ogg) via any OpenAI-compatible /audio/transcriptions
endpoint and injects the transcript into the ACP prompt as text.

- New src/stt.rs: ~50-line module calling POST /audio/transcriptions
- New SttConfig in config.rs: enabled, api_key, model, base_url
- discord.rs: detect audio/* attachments, download, transcribe, inject
- Defaults to Groq free tier (whisper-large-v3-turbo)
- Supports any OpenAI-compatible endpoint via base_url (Groq, OpenAI,
  local whisper server, etc.)
- Feature is opt-in: disabled by default, zero impact when unconfigured

Closes openabdev#224
@chaodu-agent chaodu-agent requested a review from thepagent as a code owner April 11, 2026 21:21
openab-bot added 7 commits April 11, 2026 21:23
- Reuse shared HTTP_CLIENT in stt.rs instead of creating per-call client
- Pass actual MIME type from attachment (not hardcoded audio/ogg)
- Fix attachment routing: check audio first, avoid wasted image download
- Add api_key validation at startup (fail fast on empty key)
- Add response_format=json to multipart form (fixes local servers)
- Update docs: clarify api_key requirement, add Technical Notes section
If stt.enabled = true and api_key is not set in config, openab
automatically checks for GROQ_API_KEY in the environment. This
allows minimal config:

  [stt]
  enabled = true

No api_key line needed if the env var exists.
Prevents leaking Groq API key to unrelated endpoints when user
sets a custom base_url without explicitly setting api_key.
The handler clones stt_config at construction time. Auto-detect
was running after the clone, so the handler never received the
detected api_key. Now auto-detect runs first.
Copy link
Copy Markdown
Collaborator

@thepagent thepagent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all review feedback addressed. Clean opt-in STT feature, reuses shared HTTP client, proper MIME passthrough, startup validation, and correct initialization order.

@thepagent thepagent merged commit 45f8b9d into openabdev:main Apr 11, 2026
1 check passed
Reese-max pushed a commit to Reese-max/openab that referenced this pull request Apr 12, 2026
…ev#225)

* feat: support voice message STT (Speech-to-Text) for Discord

Add optional STT support that transcribes Discord voice message
attachments (audio/ogg) via any OpenAI-compatible /audio/transcriptions
endpoint and injects the transcript into the ACP prompt as text.

- New src/stt.rs: ~50-line module calling POST /audio/transcriptions
- New SttConfig in config.rs: enabled, api_key, model, base_url
- discord.rs: detect audio/* attachments, download, transcribe, inject
- Defaults to Groq free tier (whisper-large-v3-turbo)
- Supports any OpenAI-compatible endpoint via base_url (Groq, OpenAI,
  local whisper server, etc.)
- Feature is opt-in: disabled by default, zero impact when unconfigured

Closes openabdev#224

* fix: add json feature to reqwest for resp.json() in stt module

* docs: add STT configuration and deployment guide

* fix: address PR review feedback

- Reuse shared HTTP_CLIENT in stt.rs instead of creating per-call client
- Pass actual MIME type from attachment (not hardcoded audio/ogg)
- Fix attachment routing: check audio first, avoid wasted image download
- Add api_key validation at startup (fail fast on empty key)
- Add response_format=json to multipart form (fixes local servers)
- Update docs: clarify api_key requirement, add Technical Notes section

* feat: auto-detect GROQ_API_KEY from env when stt.enabled=true

If stt.enabled = true and api_key is not set in config, openab
automatically checks for GROQ_API_KEY in the environment. This
allows minimal config:

  [stt]
  enabled = true

No api_key line needed if the env var exists.

* fix: only auto-detect GROQ_API_KEY when base_url points to Groq

Prevents leaking Groq API key to unrelated endpoints when user
sets a custom base_url without explicitly setting api_key.

* docs: clarify GROQ_API_KEY auto-detect scope in stt.md

* fix: move STT auto-detect before handler construction

The handler clones stt_config at construction time. Auto-detect
was running after the clone, so the handler never received the
detected api_key. Now auto-detect runs first.

---------

Co-authored-by: openab-bot <openab-bot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support voice message STT (Speech-to-Text) for Discord

2 participants