Hands-free, continuous voice transcription powered by Whisper ASR
Listenr provides real-time speech-to-text transcription with automatic speech detection. No more clicking start/stop buttons - just speak naturally and watch your words appear instantly!
- ✨ Continuous Listening: Microphone stays open, transcripts appear automatically
- 🎯 Smart Speech Detection: Silero VAD automatically segments your speech
- 🚀 Real-Time: WebSocket streaming for minimal latency (~500ms-2s)
- 📱 Mobile-First: Beautiful touch-optimized interface for phones
- 🤖 Optional LLM: Post-process with Ollama for improved accuracy
- 💾 Auto-Save: All audio clips and transcripts saved with metadata
- 🌐 JSON API: Consistent structured responses everywhere
# Install dependencies
pip install -r server/requirements.txt
# Install ffmpeg (if not already installed)
# Linux:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg# Basic usage
python server/app.py
# Open in browser:
# - Computer: http://localhost:5000
# - Phone: http://YOUR_IP:5000# Install Ollama from https://ollama.ai
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull gemma2:2b
# Run with LLM enabled
export LISTENR_USE_LLM=true
python server/app.py- Open http://localhost:5000 in your browser
- Click the big microphone button once
- Grant microphone permissions
- Start talking naturally
- Pause between thoughts
- Watch transcripts appear automatically!
That's it! No more clicking buttons. The system automatically:
- Detects when you start speaking
- Records your speech
- Detects when you pause/finish
- Transcribes the audio
- Displays the transcript
- Saves everything to disk
Use the unified ASR system directly from terminal:
# CLI mode (continuous terminal transcription)
python unified_asr.py
# With LLM
python unified_asr.py --llm
# Custom storage
python unified_asr.py --storage ~/my_clipsPerfect for hands-free recording on your phone:
- Start server on your computer
- Find your IP:
ip addr show | grep inet - Open
http://YOUR_IP:5000on your phone - Tap mic button once
- Put phone in pocket
- Start talking!
All transcripts save automatically with audio clips and metadata.
listenr/
├── unified_asr.py # Core ASR system (Whisper + VAD + JSON)
├── config_manager.py # Configuration management
├── llm_processor.py # Optional LLM post-processing
├── server/
│ ├── app.py # Flask WebSocket server
│ ├── requirements.txt # Python dependencies
│ └── templates/
│ └── index.html # Web interface
└── README.md # This file
Browser Microphone
↓ (continuous audio)
WebSocket Connection
↓ (base64 chunks)
UnifiedASR.process_vad_chunk()
↓ (VAD segmentation)
Whisper Transcription
↓ (optional)
LLM Post-Processing
↓
JSON Response → WebSocket → Browser
↓
Display + Save to Disk
All functionality uses a single unified_asr.py implementation that works in three modes:
from unified_asr import UnifiedASR
asr = UnifiedASR(mode='cli')
asr.start_cli() # Continuous terminal transcriptionasr = UnifiedASR(mode='web')
result = asr.process_audio(audio_data, sample_rate)
# Returns JSON with transcription + metadatadef callback(result):
print(result['transcription'])
asr = UnifiedASR(mode='stream')
asr.start_stream(callback=callback)All modes return consistent JSON:
{
"success": true,
"transcription": "raw whisper output",
"corrected_text": "LLM-corrected version",
"timestamp": "2025-10-12T10:30:00Z",
"audio": {
"path": "/path/to/clip.wav",
"url": "/audio/2025-10-12/clip_abc123.wav",
"duration": 3.5,
"sample_rate": 16000
},
"metadata": {
"date": "2025-10-12",
"uuid": "abc123",
"llm_applied": true,
"language": "en",
"mode": "stream"
}
}Edit config.ini to customize:
[VAD]
speech_threshold = 0.5 # Higher = less sensitive
min_speech_duration_s = 0.3 # Minimum speech length
max_silence_duration_s = 0.8 # Pause before ending segment
[Audio]
sample_rate = 16000
leading_silence_s = 0.3 # Silence before speech
trailing_silence_s = 0.3 # Silence after speech
[Whisper]
model_size = base # tiny, base, small, medium, large
device = cpu # cpu or cuda
compute_type = int8 # int8, float16, float32
[LLM]
enabled = true
model = gemma2:2b
temperature = 0.1Environment variables override config:
export LISTENR_STORAGE=~/my_recordings # Storage directory
export LISTENR_PORT=8080 # Server port
export LISTENR_HOST=0.0.0.0 # Server host
export LISTENR_USE_LLM=true # Enable LLMAll recordings are automatically organized by date:
~/listenr_web/
├── audio/
│ └── 2025-10-12/
│ ├── clip_2025-10-12_abc123.wav
│ └── clip_2025-10-12_def456.wav
└── transcripts/
└── 2025-10-12/
├── transcript_2025-10-12_abc123.json
└── transcript_2025-10-12_def456.json
Each transcript JSON includes:
- Raw transcription
- LLM-corrected text (if enabled)
- Audio file path and URL
- Duration, sample rate
- Timestamp, UUID
- Language detection
- All metadata
- Speak Clearly: Normal conversational pace works best
- Pause Between Thoughts: Helps VAD segment naturally (0.5-1s)
- Quiet Background: Reduces false positives
- Good Microphone: Phone/laptop mics work fine, headset is better
- Local Network: Keep phone and server on same network for best latency
If you get too many/few segments:
# More sensitive (segments shorter speech)
[VAD]
speech_threshold = 0.3
max_silence_duration_s = 0.5
# Less sensitive (waits longer)
[VAD]
speech_threshold = 0.7
max_silence_duration_s = 1.5[Whisper]
device = cuda # Use GPU
compute_type = float16 # GPU precisionMuch faster transcription on NVIDIA GPUs!
WebSocket keeps disconnecting:
- Check firewall settings
- Verify network stability
- Check server logs for errors
Microphone not working:
- Grant browser microphone permissions
- Check if another app is using the mic
- Try HTTPS for remote access (required by browsers)
Transcripts are delayed:
- Use smaller Whisper model (base or tiny)
- Enable GPU if available
- Check CPU usage
- Reduce
max_silence_duration_sfor faster segmentation
Too many/few segments:
- Adjust
speech_thresholdin config.ini - Adjust
max_silence_duration_s - Check background noise levels
from unified_asr import UnifiedASR
# Create ASR instance
asr = UnifiedASR(
mode='web',
use_llm=True,
storage_base='~/my_storage'
)
# Process audio file
import soundfile as sf
audio, sr = sf.read('recording.wav')
result = asr.process_audio(audio, sr, metadata={'user': 'john'})
print(result['transcription'])
print(result['audio']['path'])def my_callback(result):
print(f"[{result['timestamp']}] {result['transcription']}")
# Send to database, API, etc.
asr = UnifiedASR(mode='stream')
asr.start_stream(callback=my_callback)LIVE_STREAMING.md- Deep dive on live streaming modeconfig.ini- All configuration optionsprd/- Original design documents
Mozilla Public License Version 2.0 - see LICENSE
- OpenAI Whisper - Speech recognition
- faster-whisper - Fast inference
- Silero VAD - Voice activity detection
- Ollama - Local LLM inference
Enjoy hands-free, continuous voice transcription! 🎤