Skip to content

feat(voice/tts): add TTS synthesis and voice parameter on message tool#355

Open
emadomedher wants to merge 6 commits intosipeed:mainfrom
emadomedher:feat/tts-voice-output
Open

feat(voice/tts): add TTS synthesis and voice parameter on message tool#355
emadomedher wants to merge 6 commits intosipeed:mainfrom
emadomedher:feat/tts-voice-output

Conversation

@emadomedher
Copy link

Summary

Adds outbound voice capability: the agent can reply with an audio message by passing voice=true to the message tool. This is useful when the user sends a voice message and expects a voice reply, or explicitly asks for audio.

The implementation is provider-agnostic — any server that speaks OpenAI's /v1/audio/speech API works (Kokoro, Piper, Coqui, OpenAI, etc.).

Changes

  • pkg/voice/synthesizer.go: Synthesizer interface (Synthesize, IsAvailable)
  • pkg/voice/kokoro.go: KokoroSynthesizer — OpenAI-compatible /v1/audio/speech client. Returns a temp .mp3 path; caller cleans up.
  • pkg/bus/types.go: add Media []string to OutboundMessage (backward-compatible, omitempty)
  • pkg/channels/manager.go: add SendFileToChannel() — routes local file paths through a channel's Send()
  • pkg/tools/message.go: add voice=true parameter + SynthesizeCallback + SendMediaCallback. Graceful fallback to text if TTS unavailable. HasSentInRound fires for both text and voice.
  • pkg/agent/loop.go: add SetVoiceCallbacks() to attach TTS to the message tool after channel manager init
  • cmd/picoclaw/main.go: wire Kokoro after channels init

Config

"tools": {
  "tts": {
    "enabled": true,
    "api_base": "http://localhost:8100",
    "voice": "en_us-lessac-medium"
  }
}

Agent usage

The model simply sets voice=true in the message tool call:

{ "tool": "message", "content": "Sure, here's your answer.", "voice": true }

Depends on

➡️ #353 + #354 (Transcriber interface + Whisper) — shares the same voice pipeline foundation.

Tests

All existing tests pass (go test ./...).

Myka added 3 commits February 17, 2026 11:09
Currently Discord, Slack, and Telegram all hardcode *voice.GroqTranscriber
as their transcription dependency. This makes it impossible to swap in a
different STT backend without changing each channel file.

Add a Transcriber interface to pkg/voice/transcriber.go:

  type Transcriber interface {
      Transcribe(ctx context.Context, audioFilePath string) (*TranscriptionResponse, error)
      IsAvailable() bool
  }

GroqTranscriber already implements this interface (no change to its
implementation). Update Discord, Slack, and Telegram to depend on the
interface instead of the concrete type.

No behaviour change — this is a pure refactor that enables future STT
providers (e.g. local Whisper) to be dropped in without modifying channel
code.
Add a Whisper transcription backend that talks to any OpenAI-compatible
/v1/audio/transcriptions endpoint (Faster-Whisper, Whisper.cpp, etc.),
allowing self-hosted, offline speech-to-text without a Groq API key.

Changes:
- pkg/voice/whisper.go: WhisperTranscriber implementing the Transcriber
  interface. Sends audio as multipart/form-data to the configured API base.
  Health-checks via GET /v1/models so IsAvailable() is network-aware.
- pkg/config/config.go: WhisperConfig{Enabled, APIBase} added to
  ToolsConfig. Default API base: http://localhost:8200.
- cmd/picoclaw/main.go: Whisper is tried first when enabled; falls back to
  Groq if Whisper is not reachable. Both attach to Telegram, Discord, and
  Slack via the Transcriber interface introduced in the previous commit.

Config example:
  "tools": {
    "whisper": {
      "enabled": true,
      "api_base": "http://localhost:8200"
    }
  }

Depends-on: refactor(voice): introduce Transcriber interface
Adds outbound voice capability: the agent can now reply with audio by
setting voice=true on the message tool, useful when the user sends a voice
message or explicitly requests audio.

Changes:
- pkg/voice/synthesizer.go: Synthesizer interface (Synthesize, IsAvailable)
- pkg/voice/kokoro.go: KokoroSynthesizer — talks to any OpenAI-compatible
  /v1/audio/speech endpoint (Kokoro, Piper, etc.). Health-check via GET
  /v1/models. Returns a temp .mp3 path; caller cleans up.
- pkg/bus/types.go: add Media []string to OutboundMessage (backward-
  compatible, omitempty). Enables any channel to receive file paths.
- pkg/channels/manager.go: add SendFileToChannel() — synchronous media
  send that routes local file paths through the channel's Send().
- pkg/tools/message.go: add voice=true parameter + SynthesizeCallback +
  SendMediaCallback. Voice path: synthesize → send file → cleanup.
  Falls back to text if TTS unavailable. HasSentInRound fires for both.
- pkg/agent/loop.go: add SetVoiceCallbacks() to attach TTS to message tool
  after channel manager is available.
- cmd/picoclaw/main.go: wire Kokoro TTS after channels init; attaches to
  message tool via SetVoiceCallbacks().

Config example:
  "tools": {
    "tts": {
      "enabled": true,
      "api_base": "http://localhost:8100",
      "voice": "en_us-lessac-medium"
    }
  }

Depends-on: feat(voice/stt): add local Whisper STT provider
Myka added 3 commits February 17, 2026 11:21
Expand TTSConfig with model, format, and speed so the agent's voice is
fully configurable from config.json without touching code.

- TTSConfig gains: Model, Format, Speed fields (all env-overridable)
- KokoroSynthesizer: add TTSProfile struct + NewKokoroSynthesizerFromProfile()
  NewKokoroSynthesizer() is kept as a convenience wrapper (backward-compat)
- kokoroRequest: pass format and speed through to the API
- Temp file extension follows configured format (mp3/wav/ogg/etc.)
- main.go: wire all profile fields from config
- config.example.json: updated with model/format/speed examples

Full profile example:
  "tts": {
    "enabled": true,
    "api_base": "http://localhost:8100",
    "voice":   "af_nova",
    "model":   "kokoro",
    "format":  "mp3",
    "speed":   1.0
  }
Chatterbox exposes a /synthesize endpoint alongside the standard
/v1/audio/speech one. The native endpoint adds two parameters unavailable
in the OpenAI-compatible API:
  - exaggeration (0.0–1.0): emotional expressiveness of the voice
  - cfg_weight  (0.0–1.0): how closely the voice follows the prompt

Routing: when model starts with 'chatterbox' (case-insensitive), Synthesize()
posts to /synthesize with the Chatterbox body; otherwise it uses the standard
/v1/audio/speech path. All other backends are unaffected.

Changes:
- kokoro.go: chatterboxRequest struct, isChatterbox() helper, Synthesize()
  branching logic, exaggeration/cfgWeight fields on KokoroSynthesizer
- TTSProfile: Exaggeration + CFGWeight fields (defaults: 0.5 / 0.5)
- config.go: TTSConfig gains Exaggeration + CFGWeight (env-overridable)
- main.go: wire new fields through TTSProfile
- config.example.json: document exaggeration + cfg_weight

Chatterbox config example:
  "tts": {
    "enabled": true,
    "api_base": "http://localhost:8100",
    "model":    "chatterbox-1",
    "voice":    "default",
    "format":   "mp3",
    "exaggeration": 0.5,
    "cfg_weight":   0.5
  }
/v1/models returns 404 on Chatterbox — use /health instead.
All other backends keep using /v1/models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant