| title | Open-ASR Model Explorer |
|---|---|
| emoji | 🎙 |
| colorFrom | indigo |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
A hybrid ASR testbed for evaluating open-source speech recognition models (Whisper, Cohere Transcribe, Qwen3-ASR, IBM Granite) with server-side faster-whisper inference via an async Valkey queue and client-side WebGPU inference via transformers.js.
Create a local environment file from the template:
cp .env.example .envSet required values in .env:
HF_TOKEN— required for gated server-side models (for example Cohere and Granite)GPU_MEMORY_UTILIZATION=0.65(start conservative; tune upward once stable)WHISPER_MODEL=large-v3(worker model size:large-v3,medium,small,base)NUM_WORKERS=4(parallel GPU inference threads in the worker)CORS_ALLOWED_ORIGIN=*(lock to domain in production)
docker compose up --buildThis launches five services:
| Service | Role | Port |
|---|---|---|
| gateway | Caddy reverse-proxy, TLS termination | :80 / :443 |
| frontend | React SPA + Nginx API proxy | :3000 → Nginx :80 |
| backend | FastAPI — job enqueue, audio normalisation, REST API | :8000 |
| valkey | Valkey (Redis-compatible) job queue + hash store | :6379 |
| worker | faster-whisper GPU inference (BLPOP consumer) | — |
Frontend: http://localhost:3000 · Backend API: http://localhost:8000 · Swagger: http://localhost:8000/docs
# Backend + Valkey + Worker only (headless API mode)
docker compose up --build backend valkey worker
# Frontend only (requires healthy backend)
docker compose up --build frontend┌─────────────────────────────────────────────────────────────────────┐
│ Browser (React / Vite) │
│ │
│ ┌──────────────┐ Strategy Router ┌──────────────┐ │
│ │ Model Select │──────────────────► │ InferRouter │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────┴──────────────────┐ │
│ │ "WebGPU" in name? else │ │
│ └──────────────┬───────────────────┬─────┘ │
│ │ │ │
│ ┌─────────▼─────────┐ ┌──────▼──────┐ │
│ │ WebWorker │ │ p-limit pool │ │
│ │ (ONNX + LocalAg2) │ │ (8) + upload │ │
│ └─────────┬─────────┘ └──────┬──────┘ │
│ │ │ │
│ ┌────────────────────────────────────▼───────────────────▼──────┐ │
│ │ Metrics Dashboard: TTFT · ITL · RTFx · Upload Progress │ │
│ │ Jobs Sidebar: Progress Bar · Status · Playback │ │
│ └───────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────┘
│ HTTPS (Caddy TLS)
┌──────────────▼──────────────┐
│ Caddy Gateway │
│ TLS · HTTP/3 · Routing │
└──────────────┬──────────────┘
│
┌────────────────────┴────────────────────┐
│ /api/* /* │
┌─────────▼──────────┐ ┌──────────▼─────────┐
│ Nginx Proxy │ │ Nginx Static │
│ (strips /api/) │ │ (React SPA) │
└─────────┬──────────┘ └────────────────────┘
│
┌─────────▼──────────┐ ┌─────────────────┐ ┌──────────────────┐
│ FastAPI Backend │────►│ Valkey │◄────│ Worker (GPU) │
│ │ │ (job queue + │ │ faster-whisper │
│ • POST /transcribe│ │ hash store) │ │ large-v3 │
│ /batch │ │ │ │ │
│ • Audio normalize │ │ LPUSH → BLPOP │ │ ThreadPool(4) │
│ • Job enqueue │ │ HSET / HGETALL │ │ CUDA inference │
│ • /jobs CRUD │ │ EXPIRE 24h │ │ │
│ • /audio playback │ └─────────────────┘ │ Spool cleanup │
└────────────────────┘ └──────────────────┘
│
┌─────────▼──────────┐
│ /data/audio_spool │ (Docker named volume, shared)
└────────────────────┘
Source: Layered architecture drawing JSON
Source: Full-flow architecture drawing JSON
To replay a saved drawing in MCP chat, copy the elements array from one of the JSON files and pass it to the Excalidraw create_view tool.
- Browser splits files into micro-batches of 5 via
p-limit(8)— max 8 concurrent HTTP requests - Nginx proxies each
POST /api/transcribe/batch→ FastAPI (strips/api/prefix) - FastAPI normalises audio to 16 kHz mono WAV, writes to
/data/audio_spool, creates Valkey hash with 24h TTL,LPUSHjob ID to queue - Worker
BLPOPs job IDs, runs faster-whisper inference in a thread pool, writes transcript + metrics back to Valkey hash - Frontend polls
GET /jobs/{id}for status updates, displays progress bar - Cleanup: Spool files are deleted after successful transcription; Valkey hashes expire after 24h; orphan sweep runs on
DELETE /jobs
| Model | Execution Mode | Description |
|---|---|---|
openai/whisper-base |
Server (HF-GPU / HF-CPU) | Standard HF transformers pipeline |
CohereLabs/cohere-transcribe-03-2026 |
Server (HF-GPU / HF-CPU) | Custom model wrapper (trust_remote_code) |
Qwen3-ASR-1.7B |
Server (HF-GPU) | Via qwen-asr package (architecture not in transformers) |
ibm-granite/granite-4.0-1b-speech |
Server (HF-GPU) | Chat-template + <|audio|> multimodal wrapper |
Xenova/whisper-tiny / Xenova/whisper-base |
Client (WebGPU) | Runs in-browser via transformers.js ONNX |
onnx-community/cohere-transcribe-03-2026-ONNX |
Client (WebGPU) | Cohere in-browser via transformers.js ONNX |
- Zero-server transcription – audio never leaves the browser
- Persistent ONNX cache – model weights cached via the browser Cache API
- LocalAgreement-2 streaming – unconfirmed partial hypotheses are held back until two consecutive passes agree on a prefix
- faster-whisper (CTranslate2) – quantized INT8 inference, up to 4× faster than native PyTorch Whisper
- Pre-batching normalisation – all audio normalised to 16 kHz mono WAV before entering the worker
- Resource constraints – configurable
NUM_WORKERS,WHISPER_MODEL, andGPU_MEMORY_UTILIZATIONvia environment
Affects: Local development on NVIDIA RTX 5070 Ti and other SM 12.0 (Blackwell) GPUs.
The faster-whisper worker uses CTranslate2 with native CUDA kernels and runs reliably on all CUDA 12.x hardware including SM 12.0 Blackwell. No Triton compiler dependency is involved.
For the architectural decision record on engine selection, see ADR-001: Migrate ASR Inference from vLLM to faster-whisper.
model-explorer-open-asr/
├── docker-compose.yml # 5-service orchestration (gateway, frontend, backend, valkey, worker)
├── Caddyfile # Caddy reverse-proxy: TLS + API/UI routing
├── backend/
│ ├── app.py # FastAPI: job enqueue, audio normalisation, CRUD
│ ├── worker.py # BLPOP consumer: faster-whisper GPU inference
│ ├── requirements.txt
│ └── Dockerfile # CUDA 12.8 (Blackwell-compatible)
└── frontend/
├── Dockerfile # Multi-stage Vite build → Nginx serve
├── nginx.conf # API proxy + SSE streaming + SPA fallback
├── index.html
├── package.json
├── vite.config.js
└── src/
├── main.jsx
├── App.jsx
├── index.css
├── components/
│ ├── ModelSelector.jsx
│ ├── AudioRecorder.jsx
│ ├── StagedFiles.jsx # Staging area + upload progress bar
│ ├── JobsList.jsx # Jobs sidebar + completion progress bar
│ ├── MetricsDashboard.jsx
│ └── TranscriptDisplay.jsx
├── services/
│ └── inferenceRouter.js # Strategy routing + p-limit batch upload
└── workers/
└── webgpu.worker.js # transformers.js + LocalAgreement-2
| Metric | Definition |
|---|---|
| TTFT (Time-to-First-Token) | Time from request submission to the first generated token |
| ITL (Inter-Token Latency) | Mean time between successive tokens during the decode phase |
| RTFx (Real-Time Factor) | audio_duration / processing_time – values > 1 indicate faster-than-real-time |
- Adding a New ASR Model — step-by-step for HF Transformers, Batch Worker, and WebGPU models
- ADR-001: Migrate ASR Inference Engine from vLLM to faster-whisper — architectural decision and technical rationale
- Production Readiness Roadmap — roadmap for transitioning Open-ASR to a resilient enterprise-grade service
- Production Architecture Diagram — spoolless ingestion, HA queueing, multi-AZ workers, and control-plane observability
- Node.js ≥ 18
- Python 3.11
- (Optional) NVIDIA GPU with CUDA 12.x for server-side faster-whisper inference
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000.
Swagger docs: http://localhost:8000/docs
cd frontend
npm install
npm run devThe UI will be available at http://localhost:5173.
API requests are proxied to http://localhost:8000 via Vite's dev proxy.
cd backend
docker build -t open-asr-backend .
docker run --gpus all -p 8000:8000 \
-v $HOME/.cache/huggingface:/hf_cache \
open-asr-backend# From repo root
docker compose up --buildThe UI will be available at http://localhost:3000 and the API at http://localhost:8000.
- Fork this repository.
- Create a new Space at huggingface.co/new-space with the Docker SDK.
- Set the following Space secrets:
HF_TOKEN– your Hugging Face token (for gated model downloads)
- The Space Docker container exposes port
7860(Hugging Face default). - Push your fork – Spaces will automatically build the Docker image and start the server.
Synchronous full transcription (in-process HF pipeline).
Same as above but returns Server-Sent Events with token-by-token streaming.
Enqueue a single file for background transcription. Returns { job_id, status: "accepted" }.
Enqueue multiple files. Returns { jobs: [{ id, filename }, ...] }.
List all jobs scoped to a session.
Poll job status. Returns full hash: status, transcript, segments, ttft_ms, itl_ms, processing_time_s, ...
Delete all session jobs + sweep orphaned spool files.
Delete a single job and its spool file.
Re-enqueue a completed/failed job.
List available models with metadata.
Serve audio from the spool directory (browser playback).
Returns readiness details such as status, mode, loaded_models, and failed_models.
status=ok means at least one real model is ready; status=degraded means no real model is currently serving.
Returns the list of available server-side model keys.