Skip to content

Conversation

@Robinbinu
Copy link
Contributor

Summary

Implements a WebSocket endpoint for real-time text-to-speech streaming, enabling bidirectional communication and support for multiple concurrent users. Includes a complete demo client and enhanced web UI with mode switching.

Key Features

WebSocket Endpoint (/ws)

  • Real-time TTS streaming with bidirectional communication
  • Multi-user support - each connection has dedicated request queue
  • Base64-encoded audio chunks with WAV headers
  • Graceful error handling for unsupported engines
  • Auto-queuing - processes text requests sequentially per connection

Enhanced Web Interface

  • Mode toggle - switch between HTTP and WebSocket modes
  • Auto-send on typing pause - sends text after 1 second of inactivity
  • Dynamic UI - hides "Speak" button in WebSocket mode
  • Text field auto-clear after sending

WebSocket Client Demo (websocket_client.py)

  • Real-time audio playback using PyAudio
  • Audio file export - save received audio to WAV files
  • CLI support - custom test messages via command line arguments
  • Progress indicators with visual feedback
  • Graceful connection handling and cleanup

Engine Improvements

  • Dynamic engine initialization - auto-selects first available engine
  • Graceful credential handling - skips engines with missing API keys
  • Engine name tracking - fixes voice retrieval errors
  • System engine protection - prevents WebSocket use with pyttsx3 (not thread-safe)

Technical Details

Engine Compatibility

WebSocket-compatible: OpenAI, Kokoro, Azure, ElevenLabs
Not compatible: System engine (pyttsx3) - displays clear error message

Dependencies

  • websockets - WebSocket client/server
  • pyaudio - Audio playback in demo client

Files Changed

  • async_server.py - WebSocket endpoint, engine tracking, UI enhancements
  • static/tts.js - WebSocket client logic, mode switching, auto-send
  • websocket_client.py - Python demo client (new)
  • README.md - Updated documentation

Breaking Changes

None - all changes are additive and backward compatible.

Testing

Tested with OpenAI and Kokoro engines. WebSocket mode successfully handles:

  • Multiple concurrent connections
  • Rapid text input with auto-send
  • Engine switching mid-session
  • Connection interruption and recovery

@Robinbinu Robinbinu marked this pull request as ready for review January 11, 2026 15:37
@Robinbinu
Copy link
Contributor Author

Robinbinu commented Jan 11, 2026

Hi @KoljaB
Just added WebSocket support to the FastAPI server - saw in the README this was a pending feature for handling text chunk-by-chunk for LLM integration. It's working now!

What's new:

  • /ws endpoint for real-time streaming text input
  • Multiple concurrent users supported
  • Web UI toggle between HTTP/WebSocket modes
  • Auto-send after 1 sec typing pause
  • Demo client included (websocket_client.py)

Perfect for LLM integrations - each user gets their own queue so multiple conversations can run simultaneously. Works with OpenAI, Kokoro, Azure, and ElevenLabs engines.

Everything's backward compatible.

@Robinbinu
Copy link
Contributor Author

Hi @KoljaB ,
Would love your feedback when you get a chance, and feel free to merge whenever you have time!

@KoljaB
Copy link
Owner

KoljaB commented Jan 12, 2026

Thank you so much. This looks like a great PR. I'm in Spain for the next two months, so can't really look into it for a while.

@Robinbinu
Copy link
Contributor Author

Hi @KoljaB, please take your time. I was testing and found a few bugs related to handling different audio formats from different providers. They are fixed in the latest commit #57bbfcb

Kyutai Labs' Pocket TTS - lightweight 100M parameter model with:
- CPU-optimized inference (~6x real-time performance)
- Voice cloning via WAV files
- ~200ms latency to first audio chunk
- 8 built-in voices

Install with: pip install pocket-tts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants