A production-ready voice assistant system that combines FreeSWITCH telephony, speech recognition, and AI-powered conversation management to create natural voice interactions. The system supports real-time audio streaming, intelligent conversation handling, and seamless call transfers.
- RAG System: Advanced Retrieval-Augmented Generation using ChromaDB and sentence-transformers
- **FreeSWITCH Integration": Seamless telephony integration with real-time audio streaming
- Speech Recognition: Powered by Faster-Whisper for accurate speech-to-text
- LLM Integration: Intelligent conversation handling with customizable AI models
- Real-time Audio Processing: Efficient handling of audio streams with WebSockets
- Conversation Management: Context-aware dialogue handling with conversation history
- Call Transfer: Intelligent call routing based on conversation context
- Multi-threaded Architecture: High-performance handling of concurrent calls
- Comprehensive Logging: Detailed logging for debugging and monitoring
- WebSocket-based Audio Streaming: Low-latency audio processing
- Efficient Memory Management: Chunk-based audio processing for optimal performance
- Modular Design: Easy to extend and customize
- Configuration Management: Environment-based configuration system
- Error Handling: Robust error handling and recovery mechanisms
The system implements a sophisticated RAG pipeline that combines the power of large language models with your organization's specific knowledge base for accurate, up-to-date responses.
βββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Document Store βββββΆβ Text Processing βββββΆβ Vector Store β
β (.docx files) β β & Chunking β β (ChromaDB) β
βββββββββββββββββββ βββββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β User Query βββββΆβ Query Processing βββββΆβ LLM (Llama 3) β
β (Speech/Text) β β & Retrieval β β with RAG β
βββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ
Data/
βββ Modelfile # LLM system prompt and RAG instructions
βββ inject_kb.py # Script to process and vectorize documents
βββ vector_db/ # Vector store directory (created automatically)
-
Prepare Your Documents
- Place your
.docxfiles in theData/folder - The system will automatically process and chunk the content
- Supported formats:
.docx(Word documents)
- Place your
-
Install RAG Dependencies
pip install python-docx chromadb sentence-transformers nltk python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')" -
Initialize the RAG System
cd Data python inject_kb.pyThis will:
- Process all documents in the
KB_FILESlist - Split content into semantic chunks (200 tokens each)
- Generate embeddings using
all-MiniLM-L6-v2model - Store vectors in ChromaDB for efficient retrieval
- Process all documents in the
The system implements a sophisticated two-stage retrieval process with re-ranking for highly accurate context selection:
-
First-Stage Retrieval
- Uses
BAAI/bge-large-en-v1.5for initial document embedding - Retrieves top-N (default: 10) most similar chunks using cosine similarity
- Optimized for recall to ensure relevant chunks aren't missed
- Uses
-
Second-Stage Re-ranking
- Applies
cross-encoder/ms-marco-MiniLM-L-6-v2for precise relevance scoring - Re-ranks initial results based on query-document interaction
- Selects top-K (default: 1) most relevant chunks for final context
- Applies
-
Initial Retrieval Count: Adjust
initial_resultsin_retrieve_context()# In app/llm_client.py context = await self._retrieve_context(query, initial_results=15, final_results=3)
-
Final Results Count: Control how many chunks are used for context
# In app/llm_client.py context = await self._retrieve_context(query, initial_results=10, final_results=2)
-
Embedding Model:
- Current:
BAAI/bge-large-en-v1.5 - Can be replaced with any SentenceTransformer model
- Update in
llm_client.py:model = SentenceTransformer("your-model-name")
- Current:
-
Re-ranker Model:
- Current:
cross-encoder/ms-marco-MiniLM-L-6-v2 - Can be replaced with other cross-encoder models
- Update in
llm_client.py:reranker = CrossEncoder('your-cross-encoder-model')
- Current:
You can customize the LLM's behavior by modifying the Data/Modelfile and creating a custom model using Ollama.
Open Data/Modelfile in a text editor and modify the system prompt and model configuration as needed. For example:
# Specify the base model (replace with your preferred model)
FROM llama2
# Set system prompt
SYSTEM """
You are a helpful AI assistant. Provide accurate, helpful responses based on the provided context.
- Keep responses concise and professional
- If you don't know the answer, say so
- Always maintain a helpful and friendly tone
"""
# Set model parameters
PARAMETER temperature 0.7
PARAMETER top_k 50
PARAMETER top_p 0.9- Open a terminal in the
Datadirectory - Run the following command to create and install your custom model:
cd Data
ollama create custom-llm -f Modelfile- Pull the base model (if not already downloaded):
ollama pull llama2 # or your chosen base model- Start using your custom model by referencing it as
custom-llmin your API calls.
Update your application's configuration to use the new model:
# In your .env file
LLM_MODEL=custom-llmgit clone <repository-url>
cd Conversational_IVR# Linux/Mac
python -m venv venv
source venv/bin/activate# Main application
pip install -r requirements.txt
# FreeSWITCH client (only needed on FreeSWITCH server)
cd freeswitch
pip install -r requirements.txt
cd ..Create a .env file in the project root with the following variables:
# Main Application
LLM_API_URL=http://localhost:8000/test/transcription
LOG_LEVEL=INFO
# FreeSWITCH ESL Configuration
FREESWITCH_HOST=localhost
FREESWITCH_PORT=8021
FREESWITCH_PASSWORD=ClueCon
# TTS Configuration
TTS_MODEL=tts_models/en/ljspeech/tacotron2-DDC
TTS_OUTPUT_DIR=./tts_output-
Copy the example environment file
cp .env.example .env
-
Update the configuration in
.env:# FreeSWITCH ESL Connection FREESWITCH_HOST=your-freeswitch-host FREESWITCH_PORT=8021 FREESWITCH_PASSWORD=your-password # LLM API Settings LLM_API_URL=your-llm-api-url # TTS Settings TTS_MODEL=tts_models/en/ljspeech/tacotron2-DDC TTS_OUTPUT_DIR=./tts_output # Application Settings MAX_CONVERSATION_HISTORY=10 LOG_LEVEL=INFO
-
Create TTS output directory
# Linux/Mac mkdir -p /tmp/tts # Windows mkdir %TEMP%\tts
# In the project root directory
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload# In a separate terminal
cd freeswitch
python ivr_client.pyDial one of the configured extensions (5000, 5001, or 5002) from any SIP phone registered to your FreeSWITCH server.
Check the application logs for real-time debugging:
tail -f logs/app.log+----------------+ +------------------+ +----------------+
| | | | | |
| FreeSWITCH |<--->| IVR Client |<--->| Main App |
| (SIP/RTP) | | (ivr_client.py)| | (FastAPI) |
| | | | | |
+----------------+ +------------------+ +--------+-------+
|
|
+------------v-------------+
| |
| LLM API |
| (e.g., Ollama, OpenAI) |
| |
+--------------------------+
- Call arrives at FreeSWITCH
ivr.luascript handles the call and sets up audio streaming- Audio is streamed to
ivr_client.pyvia WebSocket - Speech is transcribed using Faster-Whisper
- Transcription is sent to the main application
- Main application processes the text with LLM
- Response is converted to speech and sent back to the caller
POST /test/transcription
Request Body:
{
"call_uuid": "unique-call-identifier",
"transcription": "Customer's spoken text"
}Response Headers:
X-File-Metadata: JSON string with audio file detailsX-LLM-Response: Generated text response from LLM
Response Body:
- Audio file in WAV format (16kHz, mono)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Endpoint:
ws://localhost:8000/ws - Protocol: Binary WebSocket
- Binary audio data (16kHz, mono, PCM)
{
"type": "connection",
"status": "connected",
"message": "Ready to receive audio"
}
{
"type": "partial_transcript",
"text": "partial transcription..."
}
{
"type": "final_transcript",
"text": "complete transcription"
}
{
"type": "llm_response",
"text": "AI generated response",
"user_message": "original user message"
}
{
"type": "error",
"message": "error description"
}Test the transcription endpoint:
curl -X POST "http://localhost:8000/api/transcribe" \
-H "Content-Type: application/json" \
-d '{"call_uuid": "test123", "transcription": "Hello, how can I help you today?"}'- Sample Rate: 16000 Hz
- Channels: 1 (Mono)
- Format: PCM 16-bit
- Encoding: Raw audio bytes
- Python 3.8+
- FreeSWITCH 1.10+ with ESL enabled
- LLM API endpoint (compatible with OpenAI-like API)
- Coqui TTS server (for text-to-speech synthesis)
- mod_audio_stream
- mod_esl
- mod_vlc (optional, for additional codec support)
- Properly configured audio codecs (PCM, OPUS, etc.)
- CPU: 8+ cores recommended
- RAM: 8GB+ (16GB recommended for production)
- Storage: SSD recommended for better I/O performance