AI-powered system that listens to sales conversations, tracks script progress, and suggests responses in real-time.
Status: Phase 4 (Minimal Script-Only Pivot) complete. View Detailed Status & Roadmap.
graph TB
subgraph Frontend
Browser["Web Browser (AudioWorklet)"]
end
subgraph Backend
FastAPI[FastAPI Server]
Buffer[DualBufferManager]
end
subgraph Services
Whisper["WhisperLive (Docker)"]
LocalAI["LocalAI (Phi-3.5-mini)"]
end
Browser -- Audio Stream (WS) --> FastAPI
FastAPI -- Forward Audio --> Whisper
Whisper -- Transcript --> FastAPI
FastAPI -- Text --> Buffer
Buffer -- Trigger Analysis --> LocalAI
LocalAI -- JSON Response --> FastAPI
FastAPI -- Script Guidance (WS) --> Browser
The easiest way to run the application is using Docker. This sets up both the WhisperLive server and the Web UI.
- Docker & Docker Compose
- NVIDIA GPU (Optional, but recommended for speed)
- OpenRouter API Key
-
Clone and Configure
git clone <repo> cd sales-rpg-ai echo "OPENROUTER_API_KEY=sk-or-v1-your-key-here" > .env
-
Run
make up
This will automatically detect if you have an NVIDIA GPU and configure the containers accordingly.
-
Use Open http://localhost:8080 in your browser.
- Python 3.10+
- WhisperLive Docker server running on port 9090
- OpenRouter API key (free at https://openrouter.ai)
# 1. Install dependencies
pip install -e .
pip install python-dotenv websockets uvicorn
# 2. Configure API key
echo 'OPENROUTER_API_KEY=sk-or-v1-your-key-here' > .env
# 3. Start WhisperLive server (if not running)
# CPU
docker run -it -p 9090:9090 ghcr.io/collabora/whisperlive-cpu:latest
# GPU
docker run -it --gpus all -p 9090:9090 ghcr.io/collabora/whisperlive-gpu:latest
# 4. Run Web UI
python src/web/run.py# Real-time analysis (streaming, with WhisperLive)
python src/realtime_transcribe.py path/to/audio.mp4
# Real-time with verbose output (shows all LLM responses)
uv run python src/realtime_transcribe.py path/to/audio.mp4 --verbose
# Microphone input (live)
uv run python src/realtime_transcribe.py --mic
# Test real-time flow without WhisperLive
uv run python src/test_realtime_flow.py
# Batch analysis (original Phase 1 script)
uv run python src/transcribe_and_analyze.py path/to/audio.mp4
# Test objection detection with mock data
uv run python test_objection_detection.pyDetects 4 objection types:
| Type | Description |
|---|---|
| PRICE | Cost, budget, expense concerns |
| TIME | Not ready, need to think, timing issues |
| DECISION_MAKER | Need to consult spouse/partner/boss |
| OTHER | Any other objections |
For each objection:
- Type classification
- Confidence level (HIGH/MEDIUM/LOW)
- Smokescreen detection (genuine vs. hiding concerns)
- 3 suggested responses
OBJECTION #1:
Type: PRICE
Confidence: HIGH
Smokescreen: MAYBE
Quote: "That sounds expensive"
Suggested Responses:
1. "I understand budget is a concern. Let me show you the ROI..."
2. "What specific aspect of the pricing concerns you most?"
3. "Many clients initially felt the same way, but found..."
sales-rpg-ai/
├── src/
│ ├── transcribe_and_analyze.py # Batch analysis (Phase 1)
│ ├── realtime_transcribe.py # Real-time analysis (Phase 2)
│ ├── test_realtime_flow.py # Test without WhisperLive
│ └── realtime/ # Real-time module
│ ├── __init__.py # Module exports
│ ├── buffer_manager.py # DualBufferManager, BufferConfig
│ └── analysis_orchestrator.py # AnalysisOrchestrator, StreamingAnalyzer
├── test_objection_detection.py # Mock data test suite
├── docs/
│ ├── mvp.md # Phase roadmap
│ └── product/ # PRDs and PM docs
├── WhisperLive/ # Transcription submodule
├── pyproject.toml # Dependencies
├── CHANGELOG.md # Development history
└── .env # API keys
flowchart TD
A[Audio File] --> B{File Exists?}
B -->|No| C[Exit with Error]
B -->|Yes| D[transcribe_audio]
D --> E[Connect to WhisperLive]
E --> F[Stream Audio via WebSocket]
F --> G[Whisper Model Processing]
G --> H[Callback: Update Transcript]
H --> I{Audio Complete?}
I -->|No| G
I -->|Yes| J[analyze_objections]
J --> K{API Key Present?}
K -->|No| L[Return Transcript Only]
K -->|Yes| M[Call OpenRouter API]
M --> N[Parse LLM Response]
N --> O[Display Results]
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| Transcription | WhisperLive (WebSocket, Whisper base model) |
| Analysis | OpenRouter API (Llama 3.3 70B, free tier) |
| Audio Formats | MP4, MP3, WAV, M4A, FLAC (via FFmpeg) |
| Package Manager | UV |
| Problem | Solution |
|---|---|
| "No module named 'openai'" | Use uv run python ... or activate venv first |
| "Audio file not found" | Check file path (relative or absolute) |
| "No transcript generated" | Check WhisperLive server: docker ps | grep 9090 |
| "No OpenRouter API key found" | Add to .env or export OPENROUTER_API_KEY |
- Audio transcription (WhisperLive)
- Objection detection (4 types)
- Response suggestions (3 per objection)
- Confidence scoring
- Test suite
- Live transcript streaming (WhisperLive callback)
- Architecture design (dual buffer PRD)
- DualBufferManager implementation
- AnalysisOrchestrator implementation
- Chunked analysis (~1-3s latency)
- Integration script (realtime_transcribe.py)
- Verbose mode for debugging
- Test script without WhisperLive
- Web UI (FastAPI + WebSocket)
- Docker Deployment (CPU/GPU auto-detect)
- "Gatekeeper" Rate Limiting (Cost Optimization)
- LocalAI Integration: Replaced OpenRouter with local LLM (Phi-3.5-mini) running in Docker.
- Conversation State Manager: Track sales phases (Intro, Discovery, Closing).
- Invisible Overlay UI: Desktop widget that floats over Zoom/Teams.
- Ultra-low latency: Optimize local networking for <150ms response.
- Enterprise Deployment: Self-hosted container stack (WhisperLive + LocalAI + App).
The application supports two deployment modes to fit different business models:
- Transcription: Hosted WhisperLive (or local).
- Intelligence: OpenRouter API (Llama 3.3 70B).
- Cost Control: Uses "Gatekeeper" logic to only analyze text containing objection keywords.
- Pros: No high-end hardware required for the user.
- Transcription: Local WhisperLive (GPU).
- Intelligence: LocalAI (Phi-3.5-mini).
- Pros: 100% Private, Free (no API costs), Offline capable.
- Requirements: NVIDIA GPU (4GB+ VRAM recommended).
- Intelligence: LocalAI running a quantized model (e.g., Mistral 7B) inside the Docker network.
- Privacy: 100% local processing. No audio or text leaves the machine.
- Pros: Unlimited analysis, zero latency, total privacy.
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | ~39MB | Very Fast | Lower |
| base | ~74MB | Fast | Good |
| small | ~244MB | Medium | Better |
| medium | ~769MB | Slow | Great |
| large | ~1.5GB | Very Slow | Best |
Default is base for balance of speed and accuracy.
# WhisperLive
host = "localhost"
port = 9090
model = "base"
# OpenRouter
base_url = "https://openrouter.ai/api/v1"
model = "meta-llama/llama-3.3-70b-instruct:free"flowchart TB
subgraph WhisperLive
A[Transcript Chunks] --> B[transcription_callback]
end
subgraph DualBufferManager
B --> C[Active Buffer]
B --> D[Master Transcript]
C --> E{Trigger Condition?}
E -->|Time/Words/Silence| F[Send to LLM]
E -->|No| C
F --> G[Clear Active Buffer]
G --> C
end
subgraph AnalysisOrchestrator
F --> H[Async LLM Call]
H --> I[Parse Response]
I --> J[Display Suggestions]
end
D --> K[Full Context for LLM]
K --> H
| Condition | Default | Rationale |
|---|---|---|
| Time elapsed | 3 seconds | Ensures responsiveness |
| Completed segments | 2 segments | Natural speech boundaries |
| Character count | 150 chars | Handles fast speech |
| Sentence ending | . ? ! |
Natural analysis points |
| Silence detected | 1.5 seconds | Speaker pauses |
MIT