Open-ASR Model Explorer

title	Open-ASR Model Explorer
emoji	🎙
colorFrom	indigo
colorTo	purple
sdk	docker
app_port	7860
pinned	false

Open-ASR Model Explorer

A hybrid ASR testbed for evaluating open-source speech recognition models (Whisper, Cohere Transcribe, Qwen3-ASR, IBM Granite) with server-side faster-whisper inference via an async Valkey queue and client-side WebGPU inference via transformers.js.

Quick Start

1) Environment Setup

Create a local environment file from the template:

cp .env.example .env

Set required values in .env:

HF_TOKEN — required for gated server-side models (for example Cohere and Granite)
GPU_MEMORY_UTILIZATION=0.65 (start conservative; tune upward once stable)
WHISPER_MODEL=large-v3 (worker model size: large-v3, medium, small, base)
NUM_WORKERS=4 (parallel GPU inference threads in the worker)
CORS_ALLOWED_ORIGIN=* (lock to domain in production)

2) Start the Full Stack

docker compose up --build

This launches five services:

Service	Role	Port
gateway	Caddy reverse-proxy, TLS termination	`:80` / `:443`
frontend	React SPA + Nginx API proxy	`:3000` → Nginx `:80`
backend	FastAPI — job enqueue, audio normalisation, REST API	`:8000`
valkey	Valkey (Redis-compatible) job queue + hash store	`:6379`
worker	faster-whisper GPU inference (BLPOP consumer)	—

Frontend: http://localhost:3000 · Backend API: http://localhost:8000 · Swagger: http://localhost:8000/docs

3) Running Services Individually

# Backend + Valkey + Worker only (headless API mode)
docker compose up --build backend valkey worker

# Frontend only (requires healthy backend)
docker compose up --build frontend

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     Browser (React / Vite)                          │
│                                                                     │
│  ┌──────────────┐   Strategy Router   ┌──────────────┐              │
│  │ Model Select │──────────────────►  │ InferRouter  │              │
│  └──────────────┘                     └──────┬───────┘              │
│                                              │                      │
│                        ┌─────────────────────┴──────────────────┐   │
│                        │       "WebGPU" in name?     else       │   │
│                        └──────────────┬───────────────────┬─────┘   │
│                                       │                   │         │
│                             ┌─────────▼─────────┐  ┌──────▼──────┐  │
│                             │    WebWorker      │  │ p-limit pool │ │
│                             │ (ONNX + LocalAg2) │  │ (8) + upload │ │
│                             └─────────┬─────────┘  └──────┬──────┘  │
│                                       │                   │         │
│  ┌────────────────────────────────────▼───────────────────▼──────┐  │
│  │ Metrics Dashboard: TTFT · ITL · RTFx · Upload Progress        │  │
│  │ Jobs Sidebar: Progress Bar · Status · Playback                │  │
│  └───────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────┬──────────────────────────────────┘
                   │ HTTPS (Caddy TLS)
          ┌──────────────▼──────────────┐
          │        Caddy Gateway        │
          │      TLS · HTTP/3 · Routing │
          └──────────────┬──────────────┘
                   │
        ┌────────────────────┴────────────────────┐
        │                /api/*        /*         │
  ┌─────────▼──────────┐                   ┌──────────▼─────────┐
  │     Nginx Proxy    │                   │   Nginx Static     │
  │   (strips /api/)   │                   │    (React SPA)     │
  └─────────┬──────────┘                   └────────────────────┘
        │
  ┌─────────▼──────────┐     ┌─────────────────┐     ┌──────────────────┐
  │   FastAPI Backend  │────►│      Valkey     │◄────│   Worker (GPU)   │
  │                    │     │  (job queue +   │     │  faster-whisper  │
  │  • POST /transcribe│     │   hash store)   │     │    large-v3      │
  │    /batch          │     │                 │     │                  │
  │  • Audio normalize │     │  LPUSH → BLPOP  │     │  ThreadPool(4)   │
  │  • Job enqueue     │     │  HSET / HGETALL │     │  CUDA inference  │
  │  • /jobs CRUD      │     │  EXPIRE 24h     │     │                  │
  │  • /audio playback │     └─────────────────┘     │  Spool cleanup   │
  └────────────────────┘                             └──────────────────┘
        │
  ┌─────────▼──────────┐
  │  /data/audio_spool │  (Docker named volume, shared)
  └────────────────────┘

Saved Drawings (Excalidraw MCP)

Layered Architecture (PNG)

Source: Layered architecture drawing JSON

Full-Flow Architecture (PNG)

Source: Full-flow architecture drawing JSON

To replay a saved drawing in MCP chat, copy the elements array from one of the JSON files and pass it to the Excalidraw create_view tool.

Data Flow: Batch Upload (945 files)

Browser splits files into micro-batches of 5 via p-limit(8) — max 8 concurrent HTTP requests
Nginx proxies each POST /api/transcribe/batch → FastAPI (strips /api/ prefix)
FastAPI normalises audio to 16 kHz mono WAV, writes to /data/audio_spool, creates Valkey hash with 24h TTL, LPUSH job ID to queue
Worker BLPOPs job IDs, runs faster-whisper inference in a thread pool, writes transcript + metrics back to Valkey hash
Frontend polls GET /jobs/{id} for status updates, displays progress bar
Cleanup: Spool files are deleted after successful transcription; Valkey hashes expire after 24h; orphan sweep runs on DELETE /jobs

Strategy-Based Routing

Model	Execution Mode	Description
`openai/whisper-base`	Server (HF-GPU / HF-CPU)	Standard HF transformers pipeline
`CohereLabs/cohere-transcribe-03-2026`	Server (HF-GPU / HF-CPU)	Custom model wrapper (`trust_remote_code`)
`Qwen3-ASR-1.7B`	Server (HF-GPU)	Via `qwen-asr` package (architecture not in transformers)
`ibm-granite/granite-4.0-1b-speech`	Server (HF-GPU)	Chat-template + `<\|audio\|>` multimodal wrapper
`Xenova/whisper-tiny` / `Xenova/whisper-base`	Client (WebGPU)	Runs in-browser via transformers.js ONNX
`onnx-community/cohere-transcribe-03-2026-ONNX`	Client (WebGPU)	Cohere in-browser via transformers.js ONNX

WebGPU Client-Side Features

Zero-server transcription – audio never leaves the browser
Persistent ONNX cache – model weights cached via the browser Cache API
LocalAgreement-2 streaming – unconfirmed partial hypotheses are held back until two consecutive passes agree on a prefix

Backend Inference Configuration

faster-whisper (CTranslate2) – quantized INT8 inference, up to 4× faster than native PyTorch Whisper
Pre-batching normalisation – all audio normalised to 16 kHz mono WAV before entering the worker
Resource constraints – configurable NUM_WORKERS, WHISPER_MODEL, and GPU_MEMORY_UTILIZATION via environment

Hardware Notes: SM 12.0 (Blackwell) GPUs

Affects: Local development on NVIDIA RTX 5070 Ti and other SM 12.0 (Blackwell) GPUs.

The faster-whisper worker uses CTranslate2 with native CUDA kernels and runs reliably on all CUDA 12.x hardware including SM 12.0 Blackwell. No Triton compiler dependency is involved.

For the architectural decision record on engine selection, see ADR-001: Migrate ASR Inference from vLLM to faster-whisper.

Project Structure

model-explorer-open-asr/
├── docker-compose.yml        # 5-service orchestration (gateway, frontend, backend, valkey, worker)
├── Caddyfile                 # Caddy reverse-proxy: TLS + API/UI routing
├── backend/
│   ├── app.py                # FastAPI: job enqueue, audio normalisation, CRUD
│   ├── worker.py             # BLPOP consumer: faster-whisper GPU inference
│   ├── requirements.txt
│   └── Dockerfile            # CUDA 12.8 (Blackwell-compatible)
└── frontend/
    ├── Dockerfile            # Multi-stage Vite build → Nginx serve
    ├── nginx.conf            # API proxy + SSE streaming + SPA fallback
    ├── index.html
    ├── package.json
    ├── vite.config.js
    └── src/
        ├── main.jsx
        ├── App.jsx
        ├── index.css
        ├── components/
        │   ├── ModelSelector.jsx
        │   ├── AudioRecorder.jsx
        │   ├── StagedFiles.jsx       # Staging area + upload progress bar
        │   ├── JobsList.jsx           # Jobs sidebar + completion progress bar
        │   ├── MetricsDashboard.jsx
        │   └── TranscriptDisplay.jsx
        ├── services/
        │   └── inferenceRouter.js    # Strategy routing + p-limit batch upload
        └── workers/
            └── webgpu.worker.js      # transformers.js + LocalAgreement-2

Metrics

Metric	Definition
TTFT (Time-to-First-Token)	Time from request submission to the first generated token
ITL (Inter-Token Latency)	Mean time between successive tokens during the decode phase
RTFx (Real-Time Factor)	`audio_duration / processing_time` – values > 1 indicate faster-than-real-time

Guides

Adding a New ASR Model — step-by-step for HF Transformers, Batch Worker, and WebGPU models
ADR-001: Migrate ASR Inference Engine from vLLM to faster-whisper — architectural decision and technical rationale
Production Readiness Roadmap — roadmap for transitioning Open-ASR to a resilient enterprise-grade service
Production Architecture Diagram — spoolless ingestion, HA queueing, multi-AZ workers, and control-plane observability

Local Development

Prerequisites

Node.js ≥ 18
Python 3.11
(Optional) NVIDIA GPU with CUDA 12.x for server-side faster-whisper inference

Backend

cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000.
Swagger docs: http://localhost:8000/docs

Frontend

cd frontend
npm install
npm run dev

The UI will be available at http://localhost:5173.
API requests are proxied to http://localhost:8000 via Vite's dev proxy.

Docker Setup

Backend only

cd backend
docker build -t open-asr-backend .
docker run --gpus all -p 8000:8000 \
  -v $HOME/.cache/huggingface:/hf_cache \
  open-asr-backend

Full stack with Docker Compose

# From repo root
docker compose up --build

The UI will be available at http://localhost:3000 and the API at http://localhost:8000.

Deploy to Hugging Face Spaces

Fork this repository.
Create a new Space at huggingface.co/new-space with the Docker SDK.
Set the following Space secrets:
- HF_TOKEN – your Hugging Face token (for gated model downloads)
The Space Docker container exposes port 7860 (Hugging Face default).
Push your fork – Spaces will automatically build the Docker image and start the server.

API Reference

`POST /transcribe`

Synchronous full transcription (in-process HF pipeline).

`POST /transcribe/stream`

Same as above but returns Server-Sent Events with token-by-token streaming.

`POST /transcribe/async`

Enqueue a single file for background transcription. Returns { job_id, status: "accepted" }.

`POST /transcribe/batch`

Enqueue multiple files. Returns { jobs: [{ id, filename }, ...] }.

`GET /jobs?session_id=...`

List all jobs scoped to a session.

`GET /jobs/{job_id}`

Poll job status. Returns full hash: status, transcript, segments, ttft_ms, itl_ms, processing_time_s, ...

`DELETE /jobs?session_id=...`

Delete all session jobs + sweep orphaned spool files.

`DELETE /jobs/{job_id}`

Delete a single job and its spool file.

`POST /jobs/{job_id}/resubmit`

Re-enqueue a completed/failed job.

`GET /models`

List available models with metadata.

`GET /audio/{filename}`

Serve audio from the spool directory (browser playback).

`GET /health`

Returns readiness details such as status, mode, loaded_models, and failed_models. status=ok means at least one real model is ready; status=degraded means no real model is currently serving.

`GET /models`

Returns the list of available server-side model keys.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
.vscode		.vscode
backend		backend
docs		docs
frontend		frontend
samples		samples
scribe		scribe
.env.aws		.env.aws
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
ADDING_MODELS.md		ADDING_MODELS.md
Caddyfile		Caddyfile
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deploy-aws.sh		deploy-aws.sh
docker-compose.aws.yml		docker-compose.aws.yml
docker-compose.scribe.yml		docker-compose.scribe.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Open-ASR Model Explorer

Quick Start

1) Environment Setup

2) Start the Full Stack

3) Running Services Individually

Architecture

Saved Drawings (Excalidraw MCP)

Layered Architecture (PNG)

Full-Flow Architecture (PNG)

Data Flow: Batch Upload (945 files)

Strategy-Based Routing

WebGPU Client-Side Features

Backend Inference Configuration

Hardware Notes: SM 12.0 (Blackwell) GPUs

Project Structure

Metrics

Guides

Local Development

Prerequisites

Backend

Frontend

Docker Setup

Backend only

Full stack with Docker Compose

Deploy to Hugging Face Spaces

API Reference

POST /transcribe

POST /transcribe/stream

POST /transcribe/async

POST /transcribe/batch

GET /jobs?session_id=...

GET /jobs/{job_id}

DELETE /jobs?session_id=...

DELETE /jobs/{job_id}

POST /jobs/{job_id}/resubmit

GET /models

GET /audio/{filename}

GET /health

GET /models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /transcribe`

`POST /transcribe/stream`

`POST /transcribe/async`

`POST /transcribe/batch`

`GET /jobs?session_id=...`

`GET /jobs/{job_id}`

`DELETE /jobs?session_id=...`

`DELETE /jobs/{job_id}`

`POST /jobs/{job_id}/resubmit`

`GET /models`

`GET /audio/{filename}`

`GET /health`

`GET /models`

Packages