Summary
Development on VoiceOps has stalled due to end-to-end latency that prevents genuine conversational interaction. The current pipeline takes several seconds from when the user finishes speaking to when a response begins playing back — far outside the threshold required for natural, phone-quality conversation.
Problem
The round-trip latency between human speech and agent response is not conversational. A typical human conversation tolerates roughly 200-300ms of delay. The current pipeline exceeds this by a significant margin, making real-time voice interaction feel robotic and unnatural.
Observed behavior: Multi-second delay between end of user speech and start of agent response.
Expected behavior: Near-instantaneous response onset, consistent with phone-call quality dialogue.
Why This Matters
Solving this unlocks something significant. VoiceOps is, to our knowledge, the first full-duplex voice pipeline built for an open agent platform like OpenClaw. If latency can be brought into a conversational range, this becomes a genuinely novel capability — the ability to speak with autonomous agents the same way you speak with a person on the phone.
Commercial solutions (e.g. ChatGPT Advanced Voice) have solved this, but with proprietary infrastructure, custom hardware, and large engineering teams. The open-source equivalent does not yet exist at this quality level.
Known Bottlenecks (for investigation)
- ASR (Whisper): Transcription latency after voice activity detection ends
- LLM inference: Time-to-first-token from the agent backend
- TTS (kokoro-js): Synthesis time before audio playback begins
- Streaming: Whether each stage is streamed or batched end-to-end
Goal
Identify and reduce latency at each stage of the pipeline so that the perceived response delay feels conversational. Streaming optimizations, model quantization, VAD tuning, and parallel pipeline stages are all worth exploring.
If anyone has experience with low-latency voice pipelines or has solved similar bottlenecks in open-source projects, contributions and suggestions are welcome.
Summary
Development on VoiceOps has stalled due to end-to-end latency that prevents genuine conversational interaction. The current pipeline takes several seconds from when the user finishes speaking to when a response begins playing back — far outside the threshold required for natural, phone-quality conversation.
Problem
The round-trip latency between human speech and agent response is not conversational. A typical human conversation tolerates roughly 200-300ms of delay. The current pipeline exceeds this by a significant margin, making real-time voice interaction feel robotic and unnatural.
Observed behavior: Multi-second delay between end of user speech and start of agent response.
Expected behavior: Near-instantaneous response onset, consistent with phone-call quality dialogue.
Why This Matters
Solving this unlocks something significant. VoiceOps is, to our knowledge, the first full-duplex voice pipeline built for an open agent platform like OpenClaw. If latency can be brought into a conversational range, this becomes a genuinely novel capability — the ability to speak with autonomous agents the same way you speak with a person on the phone.
Commercial solutions (e.g. ChatGPT Advanced Voice) have solved this, but with proprietary infrastructure, custom hardware, and large engineering teams. The open-source equivalent does not yet exist at this quality level.
Known Bottlenecks (for investigation)
Goal
Identify and reduce latency at each stage of the pipeline so that the perceived response delay feels conversational. Streaming optimizations, model quantization, VAD tuning, and parallel pipeline stages are all worth exploring.
If anyone has experience with low-latency voice pipelines or has solved similar bottlenecks in open-source projects, contributions and suggestions are welcome.