A voice assistant that integrates VideoSDK, a RAG pipeline, and two custom APIs for document handling: one to upload PDFs to a vector database and another to search relevant content.
The system provides a full voice flow:
STT → RAG (docs) → LLM → TTS
with a custom VideoSDK plugin for seamless integration.
- Voice Interaction: Real-time speech recognition using Deepgram STT.
- Document Retrieval: Upload PDFs to a Qdrant vector database and retrieve context for LLM responses.
- RAG Pipeline: Retrieves top-k relevant chunks from uploaded documents for more accurate answers.
- Text-to-Speech: ElevenLabs TTS reads out LLM responses.
- Custom VideoSDK Plugin: Integrates STT, RAG, LLM, and TTS in a single cascading pipeline.
- Interruptible Conversation: User speech interrupts ongoing TTS or LLM generation.
- Sample .env
OPENAI_API_KEY=<your_openai_api_key>
DEEPGRAM_API_KEY=<your_deepgram_api_key>
ELEVENLABS_API=<your_elevenlabs_api_key>
ROOM_ID=<videosdk_room_id>
AUTH_TOKEN=<videosdk_auth_token>
VECTOR_DB_URL=<qdrant_url>
VECTOR_DB_API_KEY=<qdrant_api_key>- Locally run qdrant
docker pull qdrant/qdrant - Create and activate virtual env
py -m venv venvvenv\Scripts\activate- Install requirements
pip install -r requirements.txt - For document upload and retrieve
cd backend/
uvicorn main:app --reload --port 8000 - For Voice Agent (in console)
python voice_agent.py consoleuser_input: Guide to Videosdk integration with rag pipeline?
agent_output: Build an AI Agent with RAG using VideoSDK Agents SDK Goal Your task is to build a voice AI agent using the VideoSDK Agents SDK.T...... (from document uploaded)
user_input: Explain human heart?
agent_output: The human heart is a muscular organ, roughly the size of a fist, located in the chest that pumps blood throughout the body... (llm response)