Listenr - Live Voice Transcription

Hands-free, continuous voice transcription powered by Whisper ASR

Listenr provides real-time speech-to-text transcription with automatic speech detection. No more clicking start/stop buttons - just speak naturally and watch your words appear instantly!

Features

✨ Continuous Listening: Microphone stays open, transcripts appear automatically
🎯 Smart Speech Detection: Silero VAD automatically segments your speech
🚀 Real-Time: WebSocket streaming for minimal latency (~500ms-2s)
📱 Mobile-First: Beautiful touch-optimized interface for phones
🤖 Optional LLM: Post-process with Ollama for improved accuracy
💾 Auto-Save: All audio clips and transcripts saved with metadata
🌐 JSON API: Consistent structured responses everywhere

Quick Start

Installation

# Install dependencies
pip install -r server/requirements.txt

# Install ffmpeg (if not already installed)
# Linux:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg

Run the Server

# Basic usage
python server/app.py

# Open in browser:
# - Computer: http://localhost:5000
# - Phone: http://YOUR_IP:5000

With LLM (Optional)

# Install Ollama from https://ollama.ai
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull gemma2:2b

# Run with LLM enabled
export LISTENR_USE_LLM=true
python server/app.py

Usage

Web Interface

Open http://localhost:5000 in your browser
Click the big microphone button once
Grant microphone permissions
Start talking naturally
Pause between thoughts
Watch transcripts appear automatically!

That's it! No more clicking buttons. The system automatically:

Detects when you start speaking
Records your speech
Detects when you pause/finish
Transcribes the audio
Displays the transcript
Saves everything to disk

Command Line

Use the unified ASR system directly from terminal:

# CLI mode (continuous terminal transcription)
python unified_asr.py

# With LLM
python unified_asr.py --llm

# Custom storage
python unified_asr.py --storage ~/my_clips

Mobile Usage

Perfect for hands-free recording on your phone:

Start server on your computer
Find your IP: ip addr show | grep inet
Open http://YOUR_IP:5000 on your phone
Tap mic button once
Put phone in pocket
Start talking!

All transcripts save automatically with audio clips and metadata.

Architecture

Clean Separation

listenr/
├── unified_asr.py          # Core ASR system (Whisper + VAD + JSON)
├── config_manager.py       # Configuration management
├── llm_processor.py        # Optional LLM post-processing
├── server/
│   ├── app.py              # Flask WebSocket server
│   ├── requirements.txt    # Python dependencies
│   └── templates/
│       └── index.html      # Web interface
└── README.md               # This file

Data Flow

Browser Microphone
    ↓ (continuous audio)
WebSocket Connection
    ↓ (base64 chunks)
UnifiedASR.process_vad_chunk()
    ↓ (VAD segmentation)
Whisper Transcription
    ↓ (optional)
LLM Post-Processing
    ↓
JSON Response → WebSocket → Browser
    ↓
Display + Save to Disk

Unified ASR System

All functionality uses a single unified_asr.py implementation that works in three modes:

CLI Mode

from unified_asr import UnifiedASR

asr = UnifiedASR(mode='cli')
asr.start_cli()  # Continuous terminal transcription

Web Mode (Single File)

asr = UnifiedASR(mode='web')
result = asr.process_audio(audio_data, sample_rate)
# Returns JSON with transcription + metadata

Stream Mode (Continuous)

def callback(result):
    print(result['transcription'])

asr = UnifiedASR(mode='stream')
asr.start_stream(callback=callback)

All modes return consistent JSON:

{
  "success": true,
  "transcription": "raw whisper output",
  "corrected_text": "LLM-corrected version",
  "timestamp": "2025-10-12T10:30:00Z",
  "audio": {
    "path": "/path/to/clip.wav",
    "url": "/audio/2025-10-12/clip_abc123.wav",
    "duration": 3.5,
    "sample_rate": 16000
  },
  "metadata": {
    "date": "2025-10-12",
    "uuid": "abc123",
    "llm_applied": true,
    "language": "en",
    "mode": "stream"
  }
}

Configuration

Edit config.ini to customize:

[VAD]
speech_threshold = 0.5        # Higher = less sensitive
min_speech_duration_s = 0.3   # Minimum speech length
max_silence_duration_s = 0.8  # Pause before ending segment

[Audio]
sample_rate = 16000
leading_silence_s = 0.3       # Silence before speech
trailing_silence_s = 0.3      # Silence after speech

[Whisper]
model_size = base             # tiny, base, small, medium, large
device = cpu                  # cpu or cuda
compute_type = int8           # int8, float16, float32

[LLM]
enabled = true
model = gemma2:2b
temperature = 0.1

Environment variables override config:

export LISTENR_STORAGE=~/my_recordings  # Storage directory
export LISTENR_PORT=8080                # Server port
export LISTENR_HOST=0.0.0.0             # Server host
export LISTENR_USE_LLM=true             # Enable LLM

Storage

All recordings are automatically organized by date:

~/listenr_web/
├── audio/
│   └── 2025-10-12/
│       ├── clip_2025-10-12_abc123.wav
│       └── clip_2025-10-12_def456.wav
└── transcripts/
    └── 2025-10-12/
        ├── transcript_2025-10-12_abc123.json
        └── transcript_2025-10-12_def456.json

Each transcript JSON includes:

Raw transcription
LLM-corrected text (if enabled)
Audio file path and URL
Duration, sample rate
Timestamp, UUID
Language detection
All metadata

Tips

For Best Results

Speak Clearly: Normal conversational pace works best
Pause Between Thoughts: Helps VAD segment naturally (0.5-1s)
Quiet Background: Reduces false positives
Good Microphone: Phone/laptop mics work fine, headset is better
Local Network: Keep phone and server on same network for best latency

Tuning VAD

If you get too many/few segments:

# More sensitive (segments shorter speech)
[VAD]
speech_threshold = 0.3
max_silence_duration_s = 0.5

# Less sensitive (waits longer)
[VAD]
speech_threshold = 0.7
max_silence_duration_s = 1.5

GPU Acceleration

[Whisper]
device = cuda           # Use GPU
compute_type = float16  # GPU precision

Much faster transcription on NVIDIA GPUs!

Troubleshooting

WebSocket keeps disconnecting:

Check firewall settings
Verify network stability
Check server logs for errors

Microphone not working:

Grant browser microphone permissions
Check if another app is using the mic
Try HTTPS for remote access (required by browsers)

Transcripts are delayed:

Use smaller Whisper model (base or tiny)
Enable GPU if available
Check CPU usage
Reduce max_silence_duration_s for faster segmentation

Too many/few segments:

Adjust speech_threshold in config.ini
Adjust max_silence_duration_s
Check background noise levels

Advanced Usage

Custom Integration

from unified_asr import UnifiedASR

# Create ASR instance
asr = UnifiedASR(
    mode='web',
    use_llm=True,
    storage_base='~/my_storage'
)

# Process audio file
import soundfile as sf
audio, sr = sf.read('recording.wav')

result = asr.process_audio(audio, sr, metadata={'user': 'john'})

print(result['transcription'])
print(result['audio']['path'])

Custom Callback

def my_callback(result):
    print(f"[{result['timestamp']}] {result['transcription']}")
    # Send to database, API, etc.

asr = UnifiedASR(mode='stream')
asr.start_stream(callback=my_callback)

Documentation

LIVE_STREAMING.md - Deep dive on live streaming mode
config.ini - All configuration options
prd/ - Original design documents

License

Mozilla Public License Version 2.0 - see LICENSE

Acknowledgments

OpenAI Whisper - Speech recognition
faster-whisper - Fast inference
Silero VAD - Voice activity detection
Ollama - Local LLM inference

Enjoy hands-free, continuous voice transcription! 🎤

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Listenr - Live Voice Transcription

Features

Quick Start

Installation

Run the Server

With LLM (Optional)

Usage

Web Interface

Command Line

Mobile Usage

Architecture

Clean Separation

Data Flow

Unified ASR System

CLI Mode

Web Mode (Single File)

Stream Mode (Continuous)

Configuration

Storage

Tips

For Best Results

Tuning VAD

GPU Acceleration

Troubleshooting

Advanced Usage

Custom Integration

Custom Callback

Documentation

License

Acknowledgments

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prd		prd
server		server
.gitignore		.gitignore
LICENSE		LICENSE
LIVE_STREAMING.md		LIVE_STREAMING.md
README.md		README.md
asr.py		asr.py
config_manager.py		config_manager.py
example.env		example.env
listenr-web@.service		listenr-web@.service
llm_processor.py		llm_processor.py
requirements.txt		requirements.txt
run_server.sh		run_server.sh
screenshot.png		screenshot.png
unified_asr.py		unified_asr.py

License

Rebreda/listenr

Folders and files

Latest commit

History

Repository files navigation

Listenr - Live Voice Transcription

Features

Quick Start

Installation

Run the Server

With LLM (Optional)

Usage

Web Interface

Command Line

Mobile Usage

Architecture

Clean Separation

Data Flow

Unified ASR System

CLI Mode

Web Mode (Single File)

Stream Mode (Continuous)

Configuration

Storage

Tips

For Best Results

Tuning VAD

GPU Acceleration

Troubleshooting

Advanced Usage

Custom Integration

Custom Callback

Documentation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages