Skip to content

SiliconLanguage/model-explorer-open-asr

Repository files navigation

title Open-ASR Model Explorer
emoji 🎙
colorFrom indigo
colorTo purple
sdk docker
app_port 7860
pinned false

Open-ASR Model Explorer

Deploy to Hugging Face Spaces Python FastAPI React faster-whisper

A hybrid ASR testbed for evaluating open-source speech recognition models (Whisper, Cohere Transcribe, Qwen3-ASR, IBM Granite) with server-side faster-whisper inference via an async Valkey queue and client-side WebGPU inference via transformers.js.

Open-ASR Model Explorer UI

Quick Start

1) Environment Setup

Create a local environment file from the template:

cp .env.example .env

Set required values in .env:

  • HF_TOKEN — required for gated server-side models (for example Cohere and Granite)
  • GPU_MEMORY_UTILIZATION=0.65 (start conservative; tune upward once stable)
  • WHISPER_MODEL=large-v3 (worker model size: large-v3, medium, small, base)
  • NUM_WORKERS=4 (parallel GPU inference threads in the worker)
  • CORS_ALLOWED_ORIGIN=* (lock to domain in production)

2) Start the Full Stack

docker compose up --build

This launches five services:

Service Role Port
gateway Caddy reverse-proxy, TLS termination :80 / :443
frontend React SPA + Nginx API proxy :3000 → Nginx :80
backend FastAPI — job enqueue, audio normalisation, REST API :8000
valkey Valkey (Redis-compatible) job queue + hash store :6379
worker faster-whisper GPU inference (BLPOP consumer)

Frontend: http://localhost:3000 · Backend API: http://localhost:8000 · Swagger: http://localhost:8000/docs

3) Running Services Individually

# Backend + Valkey + Worker only (headless API mode)
docker compose up --build backend valkey worker

# Frontend only (requires healthy backend)
docker compose up --build frontend

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     Browser (React / Vite)                          │
│                                                                     │
│  ┌──────────────┐   Strategy Router   ┌──────────────┐              │
│  │ Model Select │──────────────────►  │ InferRouter  │              │
│  └──────────────┘                     └──────┬───────┘              │
│                                              │                      │
│                        ┌─────────────────────┴──────────────────┐   │
│                        │       "WebGPU" in name?     else       │   │
│                        └──────────────┬───────────────────┬─────┘   │
│                                       │                   │         │
│                             ┌─────────▼─────────┐  ┌──────▼──────┐  │
│                             │    WebWorker      │  │ p-limit pool │ │
│                             │ (ONNX + LocalAg2) │  │ (8) + upload │ │
│                             └─────────┬─────────┘  └──────┬──────┘  │
│                                       │                   │         │
│  ┌────────────────────────────────────▼───────────────────▼──────┐  │
│  │ Metrics Dashboard: TTFT · ITL · RTFx · Upload Progress        │  │
│  │ Jobs Sidebar: Progress Bar · Status · Playback                │  │
│  └───────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────┬──────────────────────────────────┘
                   │ HTTPS (Caddy TLS)
          ┌──────────────▼──────────────┐
          │        Caddy Gateway        │
          │      TLS · HTTP/3 · Routing │
          └──────────────┬──────────────┘
                   │
        ┌────────────────────┴────────────────────┐
        │                /api/*        /*         │
  ┌─────────▼──────────┐                   ┌──────────▼─────────┐
  │     Nginx Proxy    │                   │   Nginx Static     │
  │   (strips /api/)   │                   │    (React SPA)     │
  └─────────┬──────────┘                   └────────────────────┘
        │
  ┌─────────▼──────────┐     ┌─────────────────┐     ┌──────────────────┐
  │   FastAPI Backend  │────►│      Valkey     │◄────│   Worker (GPU)   │
  │                    │     │  (job queue +   │     │  faster-whisper  │
  │  • POST /transcribe│     │   hash store)   │     │    large-v3      │
  │    /batch          │     │                 │     │                  │
  │  • Audio normalize │     │  LPUSH → BLPOP  │     │  ThreadPool(4)   │
  │  • Job enqueue     │     │  HSET / HGETALL │     │  CUDA inference  │
  │  • /jobs CRUD      │     │  EXPIRE 24h     │     │                  │
  │  • /audio playback │     └─────────────────┘     │  Spool cleanup   │
  └────────────────────┘                             └──────────────────┘
        │
  ┌─────────▼──────────┐
  │  /data/audio_spool │  (Docker named volume, shared)
  └────────────────────┘

Saved Drawings (Excalidraw MCP)

Layered Architecture (PNG)

Open-ASR Layered Architecture

Source: Layered architecture drawing JSON

Full-Flow Architecture (PNG)

Open-ASR Full-Flow Architecture

Source: Full-flow architecture drawing JSON

To replay a saved drawing in MCP chat, copy the elements array from one of the JSON files and pass it to the Excalidraw create_view tool.

Data Flow: Batch Upload (945 files)

  1. Browser splits files into micro-batches of 5 via p-limit(8) — max 8 concurrent HTTP requests
  2. Nginx proxies each POST /api/transcribe/batch → FastAPI (strips /api/ prefix)
  3. FastAPI normalises audio to 16 kHz mono WAV, writes to /data/audio_spool, creates Valkey hash with 24h TTL, LPUSH job ID to queue
  4. Worker BLPOPs job IDs, runs faster-whisper inference in a thread pool, writes transcript + metrics back to Valkey hash
  5. Frontend polls GET /jobs/{id} for status updates, displays progress bar
  6. Cleanup: Spool files are deleted after successful transcription; Valkey hashes expire after 24h; orphan sweep runs on DELETE /jobs

Strategy-Based Routing

Model Execution Mode Description
openai/whisper-base Server (HF-GPU / HF-CPU) Standard HF transformers pipeline
CohereLabs/cohere-transcribe-03-2026 Server (HF-GPU / HF-CPU) Custom model wrapper (trust_remote_code)
Qwen3-ASR-1.7B Server (HF-GPU) Via qwen-asr package (architecture not in transformers)
ibm-granite/granite-4.0-1b-speech Server (HF-GPU) Chat-template + <|audio|> multimodal wrapper
Xenova/whisper-tiny / Xenova/whisper-base Client (WebGPU) Runs in-browser via transformers.js ONNX
onnx-community/cohere-transcribe-03-2026-ONNX Client (WebGPU) Cohere in-browser via transformers.js ONNX

WebGPU Client-Side Features

  • Zero-server transcription – audio never leaves the browser
  • Persistent ONNX cache – model weights cached via the browser Cache API
  • LocalAgreement-2 streaming – unconfirmed partial hypotheses are held back until two consecutive passes agree on a prefix

Backend Inference Configuration

  • faster-whisper (CTranslate2) – quantized INT8 inference, up to 4× faster than native PyTorch Whisper
  • Pre-batching normalisation – all audio normalised to 16 kHz mono WAV before entering the worker
  • Resource constraints – configurable NUM_WORKERS, WHISPER_MODEL, and GPU_MEMORY_UTILIZATION via environment

Hardware Notes: SM 12.0 (Blackwell) GPUs

Affects: Local development on NVIDIA RTX 5070 Ti and other SM 12.0 (Blackwell) GPUs.

The faster-whisper worker uses CTranslate2 with native CUDA kernels and runs reliably on all CUDA 12.x hardware including SM 12.0 Blackwell. No Triton compiler dependency is involved.

For the architectural decision record on engine selection, see ADR-001: Migrate ASR Inference from vLLM to faster-whisper.


Project Structure

model-explorer-open-asr/
├── docker-compose.yml        # 5-service orchestration (gateway, frontend, backend, valkey, worker)
├── Caddyfile                 # Caddy reverse-proxy: TLS + API/UI routing
├── backend/
│   ├── app.py                # FastAPI: job enqueue, audio normalisation, CRUD
│   ├── worker.py             # BLPOP consumer: faster-whisper GPU inference
│   ├── requirements.txt
│   └── Dockerfile            # CUDA 12.8 (Blackwell-compatible)
└── frontend/
    ├── Dockerfile            # Multi-stage Vite build → Nginx serve
    ├── nginx.conf            # API proxy + SSE streaming + SPA fallback
    ├── index.html
    ├── package.json
    ├── vite.config.js
    └── src/
        ├── main.jsx
        ├── App.jsx
        ├── index.css
        ├── components/
        │   ├── ModelSelector.jsx
        │   ├── AudioRecorder.jsx
        │   ├── StagedFiles.jsx       # Staging area + upload progress bar
        │   ├── JobsList.jsx           # Jobs sidebar + completion progress bar
        │   ├── MetricsDashboard.jsx
        │   └── TranscriptDisplay.jsx
        ├── services/
        │   └── inferenceRouter.js    # Strategy routing + p-limit batch upload
        └── workers/
            └── webgpu.worker.js      # transformers.js + LocalAgreement-2

Metrics

Metric Definition
TTFT (Time-to-First-Token) Time from request submission to the first generated token
ITL (Inter-Token Latency) Mean time between successive tokens during the decode phase
RTFx (Real-Time Factor) audio_duration / processing_time – values > 1 indicate faster-than-real-time

Guides


Local Development

Prerequisites

  • Node.js ≥ 18
  • Python 3.11
  • (Optional) NVIDIA GPU with CUDA 12.x for server-side faster-whisper inference

Backend

cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn app:app --reload --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000.
Swagger docs: http://localhost:8000/docs

Frontend

cd frontend
npm install
npm run dev

The UI will be available at http://localhost:5173.
API requests are proxied to http://localhost:8000 via Vite's dev proxy.


Docker Setup

Backend only

cd backend
docker build -t open-asr-backend .
docker run --gpus all -p 8000:8000 \
  -v $HOME/.cache/huggingface:/hf_cache \
  open-asr-backend

Full stack with Docker Compose

# From repo root
docker compose up --build

The UI will be available at http://localhost:3000 and the API at http://localhost:8000.


Deploy to Hugging Face Spaces

  1. Fork this repository.
  2. Create a new Space at huggingface.co/new-space with the Docker SDK.
  3. Set the following Space secrets:
    • HF_TOKEN – your Hugging Face token (for gated model downloads)
  4. The Space Docker container exposes port 7860 (Hugging Face default).
  5. Push your fork – Spaces will automatically build the Docker image and start the server.

Deploy to Hugging Face Spaces


API Reference

POST /transcribe

Synchronous full transcription (in-process HF pipeline).

POST /transcribe/stream

Same as above but returns Server-Sent Events with token-by-token streaming.

POST /transcribe/async

Enqueue a single file for background transcription. Returns { job_id, status: "accepted" }.

POST /transcribe/batch

Enqueue multiple files. Returns { jobs: [{ id, filename }, ...] }.

GET /jobs?session_id=...

List all jobs scoped to a session.

GET /jobs/{job_id}

Poll job status. Returns full hash: status, transcript, segments, ttft_ms, itl_ms, processing_time_s, ...

DELETE /jobs?session_id=...

Delete all session jobs + sweep orphaned spool files.

DELETE /jobs/{job_id}

Delete a single job and its spool file.

POST /jobs/{job_id}/resubmit

Re-enqueue a completed/failed job.

GET /models

List available models with metadata.

GET /audio/{filename}

Serve audio from the spool directory (browser playback).

GET /health

Returns readiness details such as status, mode, loaded_models, and failed_models. status=ok means at least one real model is ready; status=degraded means no real model is currently serving.

GET /models

Returns the list of available server-side model keys.

About

A hybrid testbed for evaluating top open-source ASR models (Cohere, Qwen3-ASR, IBM Granite) featuring vLLM server-side routing and transformers.js WebGPU client-side inference.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors