EchoScribe is a high-performance, self-hosted web application for both real-time and batch audio transcription. It leverages the power of Faster-Whisper, Silero VAD, and a modern web stack to provide a fast, accurate, and private transcription solution.
The interface allows you to select different Whisper models, choose your compute device (CPU or CUDA-enabled GPU), and fine-tune VAD parameters for optimal performance.
Hereβs a quick look at how to use EchoScribe's real-time and batch transcription features.
- ποΈ Real-time Transcription: Speak into your microphone and see the transcription appear live
- π Batch Processing: Upload audio files and get the full transcription in the sidebar
- π High Performance: Uses
faster-whisperfor optimized CTranslate2-based inference - π£οΈ Voice Activity Detection (VAD): Smartly chunks audio using Silero VAD to transcribe only when speech is detected, improving accuracy and reducing processing
- βοΈ Configurable:
- Choose from various Whisper models (from
tinytolarge-v3anddistil-large-v3) - Select compute device (
CPUorCUDA) - Adjust VAD parameters like silence duration and speech probability threshold
- Configure API key authentication for secure access
- Set rate limiting for API endpoints and file uploads
- Customize session TTL and cleanup intervals
- Choose from various Whisper models (from
- π Security Features:
- Optional API key authentication with environment variable support
- Configurable rate limiting for API, uploads, and WebSocket connections
- Trusted-proxy aware client IP resolution
- Redacted settings responses and one-time WebSocket auth tokens
- File validation and path traversal protection
- Secure constant-time string comparison for authentication
- πΎ Download Recordings: After a real-time session, download your recording as an MP3 file
- π Export Transcripts: Easily copy the transcript or download it as a
.txtfile - π Modern UI: Clean and intuitive interface built with Tailwind CSS
- π§ͺ Well-Tested: Comprehensive test suite with 170+ tests covering core runtime and security paths
EchoScribe's architecture is designed for low-latency real-time processing. It uses a multi-stage, asynchronous pipeline on the backend.
The real-time transcription process involves a continuous flow of data from the client's microphone to the server, through a processing pipeline, and back to the client's screen.
graph TD
subgraph "Client-Side (Browser)"
A[Microphone] --> B{AudioWorklet}
B --> |16-bit PCM chunks| C[WebSocket Connection]
C --> K[UI Update]
K --> L[Display Transcript]
end
subgraph "Server-Side (FastAPI Backend)"
C --> D{WebSocket Ingestion}
D --> E[Raw Audio Queue]
E --> F{VAD Chunking Task}
F -- Speech Utterance --> G[Transcription Queue]
G --> H{Whisper Worker Task}
H -- Transcribed Text --> I[Results Queue]
I --> J{WebSocket Emitter}
J --> C
end
-
Client (Browser): The AudioWorklet captures audio from the microphone, downsamples it to 16kHz, and converts it to 16-bit PCM audio chunks.
-
WebSocket Connection: These raw audio chunks are sent to the backend over a persistent WebSocket connection.
-
Backend Pipeline:
- Ingestion: The server receives the audio chunks and places them into a raw audio queue.
- VAD Chunking: A dedicated task pulls from this queue and uses the Silero VAD model to detect speech. It buffers audio until it detects a pause (end of an utterance).
- Transcription: Once a complete utterance is buffered, it's sent to the transcription queue. A worker task picks it up and transcribes it using the selected faster-whisper model.
- Emitter: The resulting text is placed in a results queue. Another task sends this text back to the client over the same WebSocket.
- UI Update: The client receives the transcribed text and updates the user interface in real-time.
The batch transcription process offloads the work to a background task, allowing you to upload large files without blocking the server. You can poll the status of the job to get the result when it's ready.
sequenceDiagram
participant User as User's Browser
participant API as FastAPI Backend
participant Worker as Background Task
User->>+API: POST /api/transcribe (audio file)
API-->>-User: { "job_id": "..." }
API->>Worker: Run transcription(job_id, file)
loop Polling every 2s
User->>+API: GET /api/transcribe/status/{job_id}
API-->>-User: { "status": "processing" }
end
Note right of Worker: Transcription in progress...
Worker-->>API: Transcription complete
Note right of API: Update job status to 'completed'
User->>+API: GET /api/transcribe/status/{job_id}
API-->>-User: { "status": "completed", "result": "..." }
The application exposes several RESTful and WebSocket endpoints to power the frontend.
| Method | Path | Description |
|---|---|---|
GET |
/ |
Serves the main HTML frontend. |
GET |
/api/config |
Provides available models, compute devices, and languages to the client. |
GET |
/api/settings |
Retrieves settings from config.yaml (API key redacted; authenticated when auth is enabled). |
POST |
/api/settings |
Updates and saves new settings to config.yaml. |
POST |
/api/ws-auth-token |
Issues a short-lived WebSocket auth token (requires X-API-Key when auth is enabled). |
POST |
/api/transcribe |
Uploads an audio file for batch transcription. Returns a job_id. |
GET |
/api/transcribe/status/{job_id} |
Polls the status and result of a batch transcription job. |
DELETE |
/api/transcribe/{job_id} |
Cancels a pending or running batch transcription job. |
GET |
/api/model/status |
Returns cache and download status for a specific model. |
WEBSOCKET |
/ws/{session_id} |
Real-time connection; first config message must include auth_token from /api/ws-auth-token. |
GET |
/download/{session_id} |
Downloads the complete audio recording of a real-time session as an MP3. |
Apple Silicon (M1/M2/M3) is NOT supported for GPU acceleration.
Mac users should select the cpu device. The application will default to CPU on macOS.
- Python: Version 3.11+ is recommended.
- Git: To clone the repository.
- uv: A fast Python package installer and resolver. Install uv.
- FFmpeg: This is a system dependency and must be installed separately.
- Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg - macOS (with Homebrew):
brew install ffmpeg - Windows: Download from the official site and add the bin directory to your system's PATH.
- Ubuntu/Debian:
-
Clone the repository:
git clone https://github.com/bhattaraiprayag/echoscribe.git cd echoscribe -
Sync dependencies: EchoScribe uses
uvfor dependency management. This command creates a virtual environment and installs all dependencies (including hardware-optimized PyTorch versions).uv sync
-
Download the VAD model: The Silero VAD model is loaded through the pinned
silero-vadpackage (no runtimetorch.hubmaster-zip fetch).uv run python backend/get_vad.py
-
Start the server:
- For development (with auto-reloading):
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload
- For production:
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000
- For development (with auto-reloading):
-
Open the web interface: Open your browser and navigate to http://localhost:8000.
EchoScribe can be deployed using Docker for easier setup and isolation.
- Docker: Version 20.10+ recommended
- Docker Compose: Version 2.0+ recommended
- NVIDIA Container Toolkit (optional): Required for GPU acceleration in Docker
-
Clone and navigate to the repository:
git clone https://github.com/bhattaraiprayag/echoscribe.git cd echoscribe -
Start the application:
docker-compose up -d
-
Access the application: Open your browser and navigate to http://localhost:8000.
-
View logs:
docker-compose logs -f
-
Stop the application:
docker-compose down
To enable GPU acceleration in Docker:
-
Install NVIDIA Container Toolkit:
# Ubuntu/Debian distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker
-
Modify
docker-compose.ymlto enable GPU:services: echoscribe: # ... existing config ... deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu]
-
Verify GPU access:
docker-compose exec echoscribe nvidia-smi
Configure the application using environment variables:
| Variable | Description | Default |
|---|---|---|
ECHOSCRIBE_API_KEY |
API key for authentication (overrides config) | (empty) |
MODELS_CACHE_DIR |
Directory for model cache | ./models_cache |
Example:
ECHOSCRIBE_API_KEY=your-secret-key docker-compose up -dThe Docker setup uses volumes for persistent storage:
models_cache: Stores downloaded Whisper models (prevents re-download)
On container startup, the entrypoint ensures the mounted models_cache directory is
owned by the non-root application user so Docker bind mounts do not block model
downloads or cache repair.
The make docker-run target reuses the same tagged image as make docker-build
(echoscribe:test by default) before starting the Compose stack.
To ensure everything is working correctly, run the test and quality checks from the repository root:
make sync # Install dependencies
make lint # Ruff lint checks
make format-check # Formatting gate
make pre-commit # Run all hooks
make test # Full pytest suite
make coverage # Coverage report + fail-under gate
make smoke # Startup smoke test (backend.main:app)
make docker-build # Docker image build verification
make docker-run # Build and start the Docker stack
make docker-up # Alias for docker-run
make clean # Safely remove pycache/test/build artifactsTest Coverage:
- API endpoint testing
- Authentication and authorization
- Rate limiting and security
- File validation and sanitization
- Session and job cleanup
- Configuration management
- Model caching and concurrency
- Real-time transcription pipeline
- VAD chunking and batching
You can adjust the default application behavior by editing the backend/config.yaml file or using the /api/settings endpoint. This is particularly useful for fine-tuning the Voice Activity Detection (VAD) for your specific microphone or environment.
VAD Parameters:
prob_threshold(0.1-0.9): Speech probability threshold (higher values are stricter, default: 0.6)silence_duration(0.1-5.0s): Seconds of silence to trigger end of utterance (default: 0.7)min_speech_duration(0.1-2.0s): Minimum speech segment length for transcription (default: 0.3)
Audio Parameters:
channels(1-2): Number of audio channels (default: 1)sample_rate(8000-48000Hz): Audio sample rate (default: 16000)sample_width(1-4 bytes): Bytes per sample (default: 2)
Transcription Parameters:
context_max_length(0-500): Maximum context length for Whisper to maintain continuity (default: 224)
Cleanup Parameters:
session_ttl_minutes(1-1440): Session time-to-live in minutes (default: 60)job_retention_minutes(1-1440): Completed job retention time (default: 120)cleanup_interval_seconds(60-3600): Cleanup task interval (default: 300)
Authentication (Optional):
enabled(true/false): Enable API key authentication (default: false)api_key: Your API key (can be overridden withECHOSCRIBE_API_KEYenvironment variable)
Rate Limiting:
enabled(true/false): Enable rate limiting (default: true)requests_per_minute: API requests per IP per minute (default: 100)uploads_per_minute: File uploads per IP per minute (default: 10)websocket_connections_per_ip: Concurrent WebSocket sessions per client IP (default: 5)trusted_proxies: CIDRs/IPs allowed to supplyX-Forwarded-For(default:[])
Upload Parameters:
max_file_size_mb: Maximum accepted upload size in megabytes (default: 100)
EchoScribe includes several security features:
- API Key Authentication: Optional authentication via
X-API-Keyheader with environment variable support - Rate Limiting: Configurable per-IP limits for API endpoints, uploads, and WebSockets
- Trusted Proxy Enforcement:
X-Forwarded-Foris honored only for configured proxy CIDRs - Secrets Hygiene:
/api/settingsredactsauth.api_keyin responses - WebSocket Hardening: One-time, short-lived auth tokens replace query-string API keys
- Input Validation: Comprehensive validation for file uploads and settings updates
- Path Traversal Protection: Filename sanitization to prevent directory traversal attacks
- Secure Comparisons: Constant-time string comparison for API keys to prevent timing attacks
To enable authentication, set auth.enabled: true in config.yaml and provide an API key either in the config file or via the ECHOSCRIBE_API_KEY environment variable.
Code Quality:
- PEP8 compliant codebase
- Type annotations throughout
- Comprehensive docstrings
- 170+ automated tests
- Double-check locking for model caching
- Async/await for non-blocking I/O
Pipeline Architecture:
- Multi-stage async pipeline for real-time processing
- Queue-based communication between stages
- VAD-based intelligent audio chunking
- Transcription context management for accuracy
- Graceful shutdown handling
Contributions are welcome! Please read our CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests having strictly followed our DevOps hygiene and quality standards.
This project is licensed under the MIT License - see the LICENSE file for details.
