feat(voice/tts): add TTS synthesis and voice parameter on message tool by emadomedher · Pull Request #355 · sipeed/picoclaw

emadomedher · 2026-02-17T08:14:30Z

Summary

Adds outbound voice capability: the agent can reply with an audio message by passing voice=true to the message tool. This is useful when the user sends a voice message and expects a voice reply, or explicitly asks for audio.

The implementation is provider-agnostic — any server that speaks OpenAI's /v1/audio/speech API works (Kokoro, Piper, Coqui, OpenAI, etc.).

Changes

pkg/voice/synthesizer.go: Synthesizer interface (Synthesize, IsAvailable)
pkg/voice/kokoro.go: KokoroSynthesizer — OpenAI-compatible /v1/audio/speech client. Returns a temp .mp3 path; caller cleans up.
pkg/bus/types.go: add Media []string to OutboundMessage (backward-compatible, omitempty)
pkg/channels/manager.go: add SendFileToChannel() — routes local file paths through a channel's Send()
pkg/tools/message.go: add voice=true parameter + SynthesizeCallback + SendMediaCallback. Graceful fallback to text if TTS unavailable. HasSentInRound fires for both text and voice.
pkg/agent/loop.go: add SetVoiceCallbacks() to attach TTS to the message tool after channel manager init
cmd/picoclaw/main.go: wire Kokoro after channels init

Config

"tools": {
  "tts": {
    "enabled": true,
    "api_base": "http://localhost:8100",
    "voice": "en_us-lessac-medium"
  }
}

Agent usage

The model simply sets voice=true in the message tool call:

{ "tool": "message", "content": "Sure, here's your answer.", "voice": true }

Depends on

➡️ #353 + #354 (Transcriber interface + Whisper) — shares the same voice pipeline foundation.

Tests

All existing tests pass (go test ./...).

Currently Discord, Slack, and Telegram all hardcode *voice.GroqTranscriber as their transcription dependency. This makes it impossible to swap in a different STT backend without changing each channel file. Add a Transcriber interface to pkg/voice/transcriber.go: type Transcriber interface { Transcribe(ctx context.Context, audioFilePath string) (*TranscriptionResponse, error) IsAvailable() bool } GroqTranscriber already implements this interface (no change to its implementation). Update Discord, Slack, and Telegram to depend on the interface instead of the concrete type. No behaviour change — this is a pure refactor that enables future STT providers (e.g. local Whisper) to be dropped in without modifying channel code.

Add a Whisper transcription backend that talks to any OpenAI-compatible /v1/audio/transcriptions endpoint (Faster-Whisper, Whisper.cpp, etc.), allowing self-hosted, offline speech-to-text without a Groq API key. Changes: - pkg/voice/whisper.go: WhisperTranscriber implementing the Transcriber interface. Sends audio as multipart/form-data to the configured API base. Health-checks via GET /v1/models so IsAvailable() is network-aware. - pkg/config/config.go: WhisperConfig{Enabled, APIBase} added to ToolsConfig. Default API base: http://localhost:8200. - cmd/picoclaw/main.go: Whisper is tried first when enabled; falls back to Groq if Whisper is not reachable. Both attach to Telegram, Discord, and Slack via the Transcriber interface introduced in the previous commit. Config example: "tools": { "whisper": { "enabled": true, "api_base": "http://localhost:8200" } } Depends-on: refactor(voice): introduce Transcriber interface

Adds outbound voice capability: the agent can now reply with audio by setting voice=true on the message tool, useful when the user sends a voice message or explicitly requests audio. Changes: - pkg/voice/synthesizer.go: Synthesizer interface (Synthesize, IsAvailable) - pkg/voice/kokoro.go: KokoroSynthesizer — talks to any OpenAI-compatible /v1/audio/speech endpoint (Kokoro, Piper, etc.). Health-check via GET /v1/models. Returns a temp .mp3 path; caller cleans up. - pkg/bus/types.go: add Media []string to OutboundMessage (backward- compatible, omitempty). Enables any channel to receive file paths. - pkg/channels/manager.go: add SendFileToChannel() — synchronous media send that routes local file paths through the channel's Send(). - pkg/tools/message.go: add voice=true parameter + SynthesizeCallback + SendMediaCallback. Voice path: synthesize → send file → cleanup. Falls back to text if TTS unavailable. HasSentInRound fires for both. - pkg/agent/loop.go: add SetVoiceCallbacks() to attach TTS to message tool after channel manager is available. - cmd/picoclaw/main.go: wire Kokoro TTS after channels init; attaches to message tool via SetVoiceCallbacks(). Config example: "tools": { "tts": { "enabled": true, "api_base": "http://localhost:8100", "voice": "en_us-lessac-medium" } } Depends-on: feat(voice/stt): add local Whisper STT provider

Expand TTSConfig with model, format, and speed so the agent's voice is fully configurable from config.json without touching code. - TTSConfig gains: Model, Format, Speed fields (all env-overridable) - KokoroSynthesizer: add TTSProfile struct + NewKokoroSynthesizerFromProfile() NewKokoroSynthesizer() is kept as a convenience wrapper (backward-compat) - kokoroRequest: pass format and speed through to the API - Temp file extension follows configured format (mp3/wav/ogg/etc.) - main.go: wire all profile fields from config - config.example.json: updated with model/format/speed examples Full profile example: "tts": { "enabled": true, "api_base": "http://localhost:8100", "voice": "af_nova", "model": "kokoro", "format": "mp3", "speed": 1.0 }

Chatterbox exposes a /synthesize endpoint alongside the standard /v1/audio/speech one. The native endpoint adds two parameters unavailable in the OpenAI-compatible API: - exaggeration (0.0–1.0): emotional expressiveness of the voice - cfg_weight (0.0–1.0): how closely the voice follows the prompt Routing: when model starts with 'chatterbox' (case-insensitive), Synthesize() posts to /synthesize with the Chatterbox body; otherwise it uses the standard /v1/audio/speech path. All other backends are unaffected. Changes: - kokoro.go: chatterboxRequest struct, isChatterbox() helper, Synthesize() branching logic, exaggeration/cfgWeight fields on KokoroSynthesizer - TTSProfile: Exaggeration + CFGWeight fields (defaults: 0.5 / 0.5) - config.go: TTSConfig gains Exaggeration + CFGWeight (env-overridable) - main.go: wire new fields through TTSProfile - config.example.json: document exaggeration + cfg_weight Chatterbox config example: "tts": { "enabled": true, "api_base": "http://localhost:8100", "model": "chatterbox-1", "voice": "default", "format": "mp3", "exaggeration": 0.5, "cfg_weight": 0.5 }

/v1/models returns 404 on Chatterbox — use /health instead. All other backends keep using /v1/models.

Myka added 3 commits February 17, 2026 11:09

emadomedher mentioned this pull request Feb 17, 2026

feat(channels/matrix): add Matrix channel integration #356

Open

Myka added 3 commits February 17, 2026 11:21

fix(voice/tts): use /health endpoint for Chatterbox availability check

b21ff3e

/v1/models returns 404 on Chatterbox — use /health instead. All other backends keep using /v1/models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice/tts): add TTS synthesis and voice parameter on message tool#355

feat(voice/tts): add TTS synthesis and voice parameter on message tool#355
emadomedher wants to merge 6 commits intosipeed:mainfrom
emadomedher:feat/tts-voice-output

emadomedher commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emadomedher commented Feb 17, 2026

Summary

Changes

Config

Agent usage

Depends on

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant