A real-time voice AI pipeline engine (ASR → LLM → TTS) with interruption & streaming
Voce is a personal exploration project built with Go, dedicated to researching how to build::
- Low-latency
- High-concurrency
- Streaming
real-time AI processing systems.
Currently, it focuses on the voice dialogue pipeline (ASR → LLM → TTS), but the overall architecture is designed to gradually evolve into:
A general-purpose real-time multimodal orchestration engine
- Real-time voice AI pipeline (ASR → LLM → TTS)
- Interrupt ongoing LLM / TTS instantly (true streaming interruption)
- Independent audio / payload processing (no blocking)
- Built-in backpressure & drop strategy for real-time systems
- Sub-50ms P99 latency under 5000 concurrent sessions
This project was originally created to solve several practical problems:
- Large latency fluctuations in real-time voice dialogue pipelines
- Easy congestion or even OOM when downstream nodes slow down
- Difficult to handle interruption cleanly in streaming systems
- Asynchronous calls (e.g., LLM, TTS) may produce stale results
Therefore, Voce is more like a:
System-level prototype
for exploring:
- How real-time streaming systems are scheduled
- How to reduce allocation and GC jitter in Go
- Whether DAG orchestration is suitable for real-time AI pipelines
At present, Voce mainly targets pure voice, Socket-based real-time interaction scenarios.
Through Plugin + declarative DAG orchestration, it can implement use cases such as:
-
Full-duplex voice conversation
Socket -> ASR -> Interrupter -> LLM -> TTS -> Socket -
Real-time simultaneous interpretation
Socket -> ASR -> Translate -> TTS -> Socket
Built-in plugins and Socket are mainly designed for dialogue scenarios.
Potential directions for future exploration (no implementation guaranteed):
- WebRTC transport plugin (real-time audio & video access based on RTC)
- Voice command recognition and emotion detection in real-time dialogue
- A more general real-time orchestration runtime (not limited to dialogue)
👉 The above directions are mainly for exploration and experimentation, and do not constitute a formal roadmap.
Voce currently supports a variety of real-time processing plugins including ASR, LLM, TTS.
For the complete list and configuration, see:
- benchmark:for stress testing
- realtime_voice:full-duplex real-time voice dialogue with LLMs
.
├── biz/ # Session / WebSocket / RESTful
├── internal/
│ ├── engine/ # DAG scheduling & runtime
│ ├── protocol/ # Custom communication protocol
│ ├── schema/ # Data models (Audio / Video / Payload / Signal)
│ ├── plugins/ # Plugin system
│ └── ...
├── pkg/ # Tools
├── cmd/
│ ├── voce/ # Server entry
│ └── bench/ # Benchmark tool
├── clients/
│ ├── web/ # Web workflow editor
│ └── voce-tui/ # Terminal client
git clone https://github.com/wnnce/voce.git && cd voce
make build-all
mkdir -p configs && cp config.yaml.example configs/config.yaml
./bin/voce -c configs/config.yamlgit clone https://github.com/wnnce/voce.git && cd voce
mkdir -p configs && cp config.yaml.example configs/config.yaml
docker-compose up -d
make build-tuiFor more details, please refer to our Quick Start Guide.
Open localhost:7001 in your browser to orchestrate or modify node configurations.
Run the terminal TUI to experience full-duplex conversation.
./bin/voce-tuiRead-only by default, Copy-on-Write on modification:
mutable := payload.Mutable()
mutable.Set("processed", true)
flow.SendPayload(mutable.ReadOnly())- Object pool
- Reference counting
- Memory reuse
👉 Goal: reduce GC jitter
System control signals (e.g., pause) and signaling take precedence over media data.
Slow nodes will trigger packet dropping or canceled.
Environment: MacBook Pro M5 / 24GB RAM
| Users | Duration | Packets | Avg | P95 | P99 | MIN/MAX |
|---|---|---|---|---|---|---|
| 10 | 30s | 5,990 | 1 ms | 2 ms | 2 ms | 0 / 6 ms |
| 500 | 30s | 296,200 | 2 ms | 3 ms | 4 ms | 0 / 12 ms |
| 1000 | 1m | 1,185,200 | 2 ms | 5 ms | 7 ms | 0 / 30 ms |
| 2000 | 1m | 2,342,000 | 4 ms | 7 ms | 17 ms | 0 / 45 ms |
| 5000 | 1m | 5,637,000 | 4 ms | 11 ms | 32 ms | 0 / 61 ms |
👉 Memory around 300MB, stable GC pause
This project is a personal engineering exploration project for real-time AI systems.
- Core design and architecture are relatively stable
- Not intended for production use at this stage
- No explicit roadmap or ongoing maintenance commitment
- Copy-on-write with reference counting works extremely well in DAG fan-out scenarios
- Control signals (e.g., interruption) take precedence over media data, which is critical for real-time interaction
- Backpressure is a fundamental capability for long-lived streaming systems, not an optional optimization
- Reducing allocation in Go greatly improves tail latency stability
- Key Features
- Plugin Development
- Quick Start
- Integration Protocol
- Built-in Plugins List
- Benchmark Guide
Part of Voce’s design is inspired by the TEN Framework.
In particular, abstracting real-time processing into a graph-based orchestration and decoupling components via structured data streams influenced the early design.
This repository represents a redesign and reimplementation based on those experiences.


