feat(voice/tts): add TTS synthesis and voice parameter on message tool#355
Open
emadomedher wants to merge 6 commits intosipeed:mainfrom
Open
feat(voice/tts): add TTS synthesis and voice parameter on message tool#355emadomedher wants to merge 6 commits intosipeed:mainfrom
emadomedher wants to merge 6 commits intosipeed:mainfrom
Conversation
added 3 commits
February 17, 2026 11:09
Currently Discord, Slack, and Telegram all hardcode *voice.GroqTranscriber
as their transcription dependency. This makes it impossible to swap in a
different STT backend without changing each channel file.
Add a Transcriber interface to pkg/voice/transcriber.go:
type Transcriber interface {
Transcribe(ctx context.Context, audioFilePath string) (*TranscriptionResponse, error)
IsAvailable() bool
}
GroqTranscriber already implements this interface (no change to its
implementation). Update Discord, Slack, and Telegram to depend on the
interface instead of the concrete type.
No behaviour change — this is a pure refactor that enables future STT
providers (e.g. local Whisper) to be dropped in without modifying channel
code.
Add a Whisper transcription backend that talks to any OpenAI-compatible
/v1/audio/transcriptions endpoint (Faster-Whisper, Whisper.cpp, etc.),
allowing self-hosted, offline speech-to-text without a Groq API key.
Changes:
- pkg/voice/whisper.go: WhisperTranscriber implementing the Transcriber
interface. Sends audio as multipart/form-data to the configured API base.
Health-checks via GET /v1/models so IsAvailable() is network-aware.
- pkg/config/config.go: WhisperConfig{Enabled, APIBase} added to
ToolsConfig. Default API base: http://localhost:8200.
- cmd/picoclaw/main.go: Whisper is tried first when enabled; falls back to
Groq if Whisper is not reachable. Both attach to Telegram, Discord, and
Slack via the Transcriber interface introduced in the previous commit.
Config example:
"tools": {
"whisper": {
"enabled": true,
"api_base": "http://localhost:8200"
}
}
Depends-on: refactor(voice): introduce Transcriber interface
Adds outbound voice capability: the agent can now reply with audio by
setting voice=true on the message tool, useful when the user sends a voice
message or explicitly requests audio.
Changes:
- pkg/voice/synthesizer.go: Synthesizer interface (Synthesize, IsAvailable)
- pkg/voice/kokoro.go: KokoroSynthesizer — talks to any OpenAI-compatible
/v1/audio/speech endpoint (Kokoro, Piper, etc.). Health-check via GET
/v1/models. Returns a temp .mp3 path; caller cleans up.
- pkg/bus/types.go: add Media []string to OutboundMessage (backward-
compatible, omitempty). Enables any channel to receive file paths.
- pkg/channels/manager.go: add SendFileToChannel() — synchronous media
send that routes local file paths through the channel's Send().
- pkg/tools/message.go: add voice=true parameter + SynthesizeCallback +
SendMediaCallback. Voice path: synthesize → send file → cleanup.
Falls back to text if TTS unavailable. HasSentInRound fires for both.
- pkg/agent/loop.go: add SetVoiceCallbacks() to attach TTS to message tool
after channel manager is available.
- cmd/picoclaw/main.go: wire Kokoro TTS after channels init; attaches to
message tool via SetVoiceCallbacks().
Config example:
"tools": {
"tts": {
"enabled": true,
"api_base": "http://localhost:8100",
"voice": "en_us-lessac-medium"
}
}
Depends-on: feat(voice/stt): add local Whisper STT provider
added 3 commits
February 17, 2026 11:21
Expand TTSConfig with model, format, and speed so the agent's voice is
fully configurable from config.json without touching code.
- TTSConfig gains: Model, Format, Speed fields (all env-overridable)
- KokoroSynthesizer: add TTSProfile struct + NewKokoroSynthesizerFromProfile()
NewKokoroSynthesizer() is kept as a convenience wrapper (backward-compat)
- kokoroRequest: pass format and speed through to the API
- Temp file extension follows configured format (mp3/wav/ogg/etc.)
- main.go: wire all profile fields from config
- config.example.json: updated with model/format/speed examples
Full profile example:
"tts": {
"enabled": true,
"api_base": "http://localhost:8100",
"voice": "af_nova",
"model": "kokoro",
"format": "mp3",
"speed": 1.0
}
Chatterbox exposes a /synthesize endpoint alongside the standard
/v1/audio/speech one. The native endpoint adds two parameters unavailable
in the OpenAI-compatible API:
- exaggeration (0.0–1.0): emotional expressiveness of the voice
- cfg_weight (0.0–1.0): how closely the voice follows the prompt
Routing: when model starts with 'chatterbox' (case-insensitive), Synthesize()
posts to /synthesize with the Chatterbox body; otherwise it uses the standard
/v1/audio/speech path. All other backends are unaffected.
Changes:
- kokoro.go: chatterboxRequest struct, isChatterbox() helper, Synthesize()
branching logic, exaggeration/cfgWeight fields on KokoroSynthesizer
- TTSProfile: Exaggeration + CFGWeight fields (defaults: 0.5 / 0.5)
- config.go: TTSConfig gains Exaggeration + CFGWeight (env-overridable)
- main.go: wire new fields through TTSProfile
- config.example.json: document exaggeration + cfg_weight
Chatterbox config example:
"tts": {
"enabled": true,
"api_base": "http://localhost:8100",
"model": "chatterbox-1",
"voice": "default",
"format": "mp3",
"exaggeration": 0.5,
"cfg_weight": 0.5
}
/v1/models returns 404 on Chatterbox — use /health instead. All other backends keep using /v1/models.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds outbound voice capability: the agent can reply with an audio message by passing
voice=trueto themessagetool. This is useful when the user sends a voice message and expects a voice reply, or explicitly asks for audio.The implementation is provider-agnostic — any server that speaks OpenAI's
/v1/audio/speechAPI works (Kokoro, Piper, Coqui, OpenAI, etc.).Changes
pkg/voice/synthesizer.go:Synthesizerinterface (Synthesize,IsAvailable)pkg/voice/kokoro.go:KokoroSynthesizer— OpenAI-compatible/v1/audio/speechclient. Returns a temp.mp3path; caller cleans up.pkg/bus/types.go: addMedia []stringtoOutboundMessage(backward-compatible,omitempty)pkg/channels/manager.go: addSendFileToChannel()— routes local file paths through a channel'sSend()pkg/tools/message.go: addvoice=trueparameter +SynthesizeCallback+SendMediaCallback. Graceful fallback to text if TTS unavailable.HasSentInRoundfires for both text and voice.pkg/agent/loop.go: addSetVoiceCallbacks()to attach TTS to the message tool after channel manager initcmd/picoclaw/main.go: wire Kokoro after channels initConfig
Agent usage
The model simply sets
voice=truein the message tool call:{ "tool": "message", "content": "Sure, here's your answer.", "voice": true }Depends on
➡️ #353 + #354 (Transcriber interface + Whisper) — shares the same voice pipeline foundation.
Tests
All existing tests pass (
go test ./...).