A powerful, local-first interactive voice assistant built for offline capability and complete control. This project orchestrates state-of-the-art local AI models to provide a seamless voice-to-voice experience without relying on third-party cloud APIs.
- Full Voice Interaction: Talk to the agent and hear it speak back naturally.
- Persistent Memory: Remembers your previous conversations across sessions using a robust SQLite database.
- Local Intelligence: Powered by DeepSeek R1 (via Ollama) for reasoning and Faster Whisper for transcription.
- Intelligent Summarization: Automatically summarizes interactions to maintain concise context.
| Advantages | Trade-offs |
|---|---|
| Zero Cost: No recurring API fees; runs entirely on your hardware. | Hardware Dependent: Performance scales with your CPU/GPU power. |
| Offline: Works completely without an internet connection (after initial setup). | Setup: Requires installing and managing local models (Ollama, etc.). |
| Customizable: Full access to modify the graph, prompts, and memory logic. | Model Capability: Local models (e.g., 8B) are powerful but may lag behind massive cloud models (e.g., GPT-4) in complex reasoning. |
| Low Latency: Eliminates network latency constraints. | Resource Usage: Can be memory and compute intensive during inference. |
The system uses LangGraph to manage the conversational flow:
- Transcribe:
Faster Whisperconverts your voice to text. - Context retrieval: Fetches conversation history and session context from
SQLite. - Process:
DeepSeek R1generates a response and "thinks" through the problem. - Synthesize:
Chatterbox TTSconverts the text response back to audio. - Save & Summarize: The interaction is logged, and a summary is generated for future context.
- Python 3.10+
- Ollama running locally with the model pulled:
ollama pull deepseek-r1:8b - FFmpeg (often required for audio processing)
-
Clone & Setup:
git clone https://github.com/abrarshahh/PyVoiceAgent.git cd PyVoiceAgent python -m venv .venv .venv\Scripts\activate # MacOS/Ubuntu: source .venv/bin/activate
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Server:
python -m fastapi run app/main.py
The API is simple and RESTful. All endpoints support a session_id to track your specific conversation context.
POST /chat/text
{
"text": "Hello, how are you?",
"session_id": "optional-custom-session-id"
}Returns: Audio file (.wav)
POST /chat/voice
- Form Data:
file: (Audio file, e.g., mp3/wav)session_id: (Text, optional)
Returns: Audio file (.wav)
app/main.py: API entry point.app/graph.py: LangGraph workflow definition.app/nodes.py: Core logic for Transcription, LLM processing, and TTS.app/database.py: SQLite storage and context management.app/state.py: Data structures for the agent's state.conversation_memory.db: Local database file (auto-created).