⭐ Star to follow updates & roadmap
SIP-to-AI — stream RTP audio from FreeSWITCH / OpenSIPS / Asterisk directly to end-to-end realtime voice models:
- ✅ OpenAI Realtime API (gpt-realtime GA)
- ✅ Deepgram Voice Agent
- 🔜 Gemini Live (coming soon)
Simple passthrough bridge: SIP (G.711 μ-law @ 8kHz) ↔ AI voice models with native codec support, no resampling needed.
Prerequisites: Python 3.12+, UV package manager
Pure Python, No External Dependencies: This project uses a pure Python asyncio implementation of SIP+RTP. No C libraries or compilation required!
-
Install dependencies:
git clone <repository-url> cd sip-to-ai uv venv && source .venv/bin/activate uv sync
-
Configure environment:
cp .env.example .env
Edit
.envwith your OpenAI API key:# AI Service AI_VENDOR=openai OPENAI_API_KEY=sk-proj-your-key-here OPENAI_MODEL=gpt-realtime # Agent prompt AGENT_PROMPT_FILE=agent_prompt.yaml # SIP Settings (userless account - receive only) SIP_DOMAIN=192.168.1.100 SIP_TRANSPORT_TYPE=udp SIP_PORT=6060
Optional: Create
agent_prompt.yamlfor custom agent personality:instructions: | You are a helpful AI assistant. Be concise and friendly. greeting: "Hello! How can I help you today?"
-
Run the server:
uv run python -m app.main
The server will listen on
SIP_DOMAIN:SIP_PORTfor incoming calls. Each call creates an independent OpenAI Realtime WebSocket connection. -
Make a test call:
# From FreeSWITCH/Asterisk, dial to bridge IP:port # Or use a SIP softphone to call sip:192.168.1.100:6060
graph LR
SIP[Pure Asyncio SIP+RTP<br/>G.711 @ 8kHz] <--> AA[AudioAdapter<br/>Codec Only]
AA <--> AI[AI WebSocket<br/>G.711 μ-law @ 8kHz]
Design Philosophy: Minimal client logic. The bridge is a transparent audio pipe:
- Pure Python asyncio: No GIL issues, no C dependencies
- Codec conversion only: PCM16 ↔ G.711 μ-law (same 8kHz, no resampling)
- Precise 20ms timing: Using
asyncio.sleep()with drift correction - Structured concurrency: All tasks managed with
asyncio.TaskGroup - No client-side VAD/barge-in: AI models handle all voice activity detection
- No jitter buffer: AI services provide pre-buffered audio
- Connection management: WebSocket lifecycle and reconnection
sequenceDiagram
participant RTP as RTP Session
participant Bridge as Audio Bridge
participant AI as OpenAI/Deepgram
Note over RTP,AI: Uplink (SIP → AI)
RTP->>Bridge: Receive G.711 packet (160 bytes)
Bridge->>Bridge: G.711 → PCM16 (320 bytes)
Bridge->>AI: WebSocket send(PCM16)
Note over RTP,AI: Downlink (AI → SIP)
AI->>Bridge: WebSocket receive(PCM16 chunks)
Bridge->>Bridge: Accumulate & split to 320-byte frames
Bridge->>Bridge: PCM16 → G.711 (160 bytes)
RTP->>Bridge: Request audio frame
Bridge->>RTP: Send G.711 packet (160 bytes)
Key Points:
- 20ms frames: 320 bytes PCM16 (8kHz) or 160 bytes G.711 μ-law
- Asyncio-based: RTP protocol → asyncio.Queue → async AI WebSocket
- Variable AI chunks: Accumulated in buffer, split into fixed 320-byte frames
- No padding during streaming: Incomplete frames kept until next chunk arrives
AsyncSIPServer (app/sip_async/async_sip_server.py)
- Pure asyncio SIP server listening for INVITE requests
- UDP datagram protocol for SIP signaling
- Creates AsyncCall instances for each incoming call
- Handles SIP messages: INVITE, ACK, BYE with proper RFC 3261 responses
RTPSession (app/sip_async/rtp_session.py)
- Pure asyncio RTP protocol implementation
- G.711 μ-law codec (PCMU) support
- Precise 20ms frame timing with drift correction
- Bidirectional audio streaming over UDP
RTPAudioBridge (app/sip_async/audio_bridge.py)
- Bridges RTP session with AudioAdapter
- Handles G.711 ↔ PCM16 codec conversion
- Uses asyncio.TaskGroup for structured concurrency
AudioAdapter (app/bridge/audio_adapter.py)
- Audio format adapter for SIP ↔ AI streaming
- PCM16 passthrough with optional codec conversion
- Accumulation buffer for variable-size AI chunks → fixed 320-byte frames
- Thread-safe buffers:
asyncio.Queuefor uplink and downlink
CallSession (app/bridge/call_session.py)
- Manages AI connection lifecycle for a single call
- Three async tasks per call:
- Uplink: Read from AudioAdapter → send to AI
- AI Receive: Receive AI chunks → feed to AudioAdapter
- Health: Ping AI connection, reconnect on failure
- Uses
asyncio.TaskGroupfor structured concurrency
OpenAIRealtimeClient (app/ai/openai_realtime.py)
- WebSocket:
wss://api.openai.com/v1/realtime - Audio format:
audio/pcmu(G.711 μ-law @ 8kHz) - Supports session config: instructions, voice, temperature
- Optional greeting message on connect
DeepgramAgentClient (app/ai/deepgram_agent.py)
- WebSocket:
wss://agent.deepgram.com/agent - Audio format: mulaw (same as G.711 μ-law @ 8kHz)
- Settings: listen model, speak model, LLM model, agent prompt
Set AI_VENDOR=deepgram in .env:
AI_VENDOR=deepgram
DEEPGRAM_API_KEY=your-key-here
AGENT_PROMPT_FILE=agent_prompt.yaml
DEEPGRAM_LISTEN_MODEL=nova-2
DEEPGRAM_SPEAK_MODEL=aura-asteria-en
DEEPGRAM_LLM_MODEL=gpt-4o-miniCreate agent_prompt.yaml (required):
instructions: |
You are a helpful AI assistant. Be concise and friendly.
greeting: "Hello! How can I help you today?"Get your API key from Deepgram Console.
Latency:
- SIP → AI: <10ms (codec only)
- AI → SIP: <10ms (codec only)
- Total: ~100-300ms (AI processing dominates)
Why Fast?
- No resampling (8kHz throughout)
- No client-side VAD/barge-in
- No jitter buffer
- Just codec conversion
Choppy Audio: Check network to AI service. AI handles jitter buffering.
High Latency: Verify AI service response times. Client-side is <10ms.
SIP Connection Failed:
- Check firewall/NAT for incoming SIP INVITE on UDP port
- Verify
SIP_DOMAINandSIP_PORTin.env - Check logs for SIP protocol errors
AI Disconnection:
- Validate API keys
- Check service quotas and rate limits
- Monitor logs for reconnection attempts
Apache License 2.0
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Pure Python implementation with no GPL dependencies.