A TypeScript + Python web UI for benchmarking open-source speech-to-text models on CPU.
| Model | Variants | Notes |
|---|---|---|
| faster-whisper | tiny · base · small · medium | CTranslate2-optimised Whisper |
| whisper.cpp | base.en (configurable) | Pure C++ inference |
| Vosk | small-en (configurable) | Kaldi-based, fully offline |
| WhisperX | base (configurable) | Whisper + forced alignment |
| Tool | Minimum version | Check |
|---|---|---|
| Node.js | 18 LTS | node -v |
| Python | 3.10 | python3 --version |
| ffmpeg | any recent | ffmpeg -version |
| cmake | any recent | cmake --version |
| whisper.cpp binary | — | see §4 |
Ubuntu / Debian
sudo apt update && sudo apt install -y ffmpeg python3-pip python3-venv cmakemacOS (Homebrew)
brew install ffmpeg node python cmakeWindows
- Install Node 18 LTS from https://nodejs.org
- Install Python 3.10+ from https://python.org
- Install ffmpeg:
winget install ffmpegor download from https://ffmpeg.org/download.html
git clone <repo-url> stt-benchmark
cd stt-benchmark
npm installIt is strongly recommended to use a virtual environment.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# CPU-only PyTorch (saves ~2 GB vs the CUDA wheel)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
# All other deps
pip install -r requirements.txtNote:
whisperxhas a transitive dependency onctranslate2. If the pip install fails, try upgrading pip first:pip install --upgrade pip.
mkdir -p models
cd models
# Small English model (~40 MB) — fastest
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
# Full English model (~1.8 GB) — more accurate
# wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
# unzip vosk-model-en-us-0.22.zipBy default the server looks for:
models/vosk-model-small-en-us-0.15/
Override with an environment variable:
export VOSK_MODEL_PATH=/absolute/path/to/vosk-modelOption A — build from source (recommended)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)
# Download a GGML model
bash models/download-ggml-model.sh base.en
# Copy binary and model into the project
cp build/bin/whisper-cli ../stt-benchmark/whisper-cli
cp models/ggml-base.en.bin ../stt-benchmark/models/
cd ..Option B — pre-built binary (Linux x86_64)
wget https://github.com/ggerganov/whisper.cpp/releases/latest/download/whisper.cpp-linux-x64.tar.gz
tar -xzf whisper.cpp-linux-x64.tar.gzEnvironment variables:
export WHISPER_CPP_BIN=/path/to/whisper-cli # default: "whisper-cli" in project root
export WHISPER_CPP_MODEL=/path/to/ggml-base.en.bin # default: models/ggml-base.en.binThe binary produced by recent whisper.cpp builds is named
whisper-cli, notmain.
Copy the example env file and set your credentials:
cp .env.example .envEdit .env:
AUTH_USER=your_username
AUTH_PASS=your_password
PORT=3001
The server requires HTTP Basic Auth on all API routes. The frontend login screen reads the same credentials. .env is gitignored and never committed.
# Compile app.ts → frontend/dist/app.js, then copy next to index.html
npm run build:frontend
cp frontend/dist/app.js frontend/app.jsUse the provided scripts (they handle venv activation, port cleanup, and health checking):
# Start (waits up to 20 s for the server to be ready)
./start.sh
# Stop
./stop.shOr run manually:
source .venv/bin/activate
npm run devThe server starts on http://localhost:3001 (or the PORT set in .env).
When accessed over a network (e.g. a VM's public IP), use the machine's IP directly — e.g.
http://34.x.x.x:3001. All API calls use relative URLs so they follow the page origin.
All scripts accept an audio file path and output JSON. You can test them directly:
faster-whisper
python3 backend/scripts/run_faster_whisper.py sample.wav tiny
python3 backend/scripts/run_faster_whisper.py sample.wav base
python3 backend/scripts/run_faster_whisper.py sample.wav small
python3 backend/scripts/run_faster_whisper.py sample.wav mediumWhisperX
python3 backend/scripts/run_whisperx.py sample.wavVosk
python3 backend/scripts/vosk_transcribe.py sample.wav models/vosk-model-small-en-us-0.15whisper.cpp (via curl with auth)
curl -u emtester:supress232 -X POST http://localhost:3001/transcribe/whisper-cpp \
-F "audio=@sample.wav"All at once via the API
AUTH="-u emtester:supress232"
# List available models
curl $AUTH http://localhost:3001/models
# Transcribe with faster-whisper base
curl $AUTH -X POST http://localhost:3001/transcribe/faster-whisper-base \
-F "audio=@sample.wav"
# View saved benchmark results
curl $AUTH http://localhost:3001/benchmarkAll endpoints require HTTP Basic Auth (Authorization: Basic <base64>) or the
X-API-Key: <base64(user:pass)> header.
| Method | Endpoint | Description |
|---|---|---|
GET |
/models |
List all available model IDs |
POST |
/transcribe/:modelId |
Upload audio (multipart audio field) and transcribe |
GET |
/benchmark |
Return all saved single-file results |
DELETE |
/benchmark |
Clear all single-file results |
DELETE |
/benchmark/:id |
Delete one result by ID |
| Method | Endpoint | Description |
|---|---|---|
POST |
/benchmark-batch |
Upload a ZIP containing audio files + mapping.json |
GET |
/benchmark-batch |
Return all past batch analyses |
DELETE |
/benchmark-batch |
Clear all batch results |
DELETE |
/benchmark-batch/:id |
Delete one batch result by ID |
ZIP structure (flexible — files may be at root or inside a single folder):
my-test.zip
├── mapping.json ← required: maps filename → reference transcript
├── audio1.wav
└── audio2.wav
mapping.json format:
{
"audio1.wav": "The reference transcript for audio one.",
"audio2.wav": "Another reference transcript."
}The batch endpoint runs every selected model against every audio file and returns per-file WER (Word Error Rate) and CER (Character Error Rate) alongside transcription time and audio duration.
| Method | Endpoint | Description |
|---|---|---|
GET |
/backfill-durations/scan |
Scan uploads/ and backfill durations for existing results |
POST |
/backfill-durations/zip |
Upload a ZIP to backfill durations for results that lost their audio |
| ID | Description |
|---|---|
faster-whisper-tiny |
faster-whisper, tiny variant |
faster-whisper-base |
faster-whisper, base variant |
faster-whisper-small |
faster-whisper, small variant |
faster-whisper-medium |
faster-whisper, medium variant |
whisper-cpp |
whisper.cpp, ggml-base.en |
vosk |
Vosk small-en |
whisperx |
WhisperX base |
{
"id": "uuid",
"model": "faster-whisper-base",
"variant": "base",
"transcription": "Hello world this is a test.",
"timeTakenMs": 1842,
"cpuPercent": 98.5,
"audioFile": "test.wav",
"audioDurationMs": 4200,
"timestamp": "2026-04-16T12:34:56.000Z",
"error": null
}stt-benchmark/
├── backend/
│ ├── server.ts ← Express REST API (ts-node)
│ └── scripts/
│ ├── run_faster_whisper.py ← faster-whisper inference
│ ├── run_whisperx.py ← WhisperX inference
│ └── vosk_transcribe.py ← Vosk inference
├── frontend/
│ ├── index.html ← Single-page UI + login overlay
│ ├── app.ts ← TypeScript source
│ └── app.js ← Compiled output (copy from dist/ after build)
├── models/ ← Vosk model dir + GGML .bin (not committed)
├── uploads/ ← Temporary audio uploads (runtime, not committed)
├── .env ← Credentials & port (not committed — see .env.example)
├── .env.example ← Template for .env
├── benchmark_results.json ← Persistent single-file results (runtime)
├── batch_results.json ← Persistent batch results (runtime)
├── whisper-cli ← whisper.cpp binary (not committed)
├── start.sh ← Start server with health check
├── stop.sh ← Stop server cleanly
├── package.json
├── tsconfig.json ← Backend TS config
├── tsconfig.frontend.json ← Frontend TS config
└── requirements.txt
Login screen shown / 401 errors — ensure .env exists with correct AUTH_USER/AUTH_PASS. The frontend uses HTTP Basic Auth; credentials are stored in sessionStorage for the browser session only.
CORS / private network error in browser — if accessing via a VM's public IP, open the app at http://<public-ip>:3001 directly, not via a proxy that rewrites the origin to localhost.
faster-whisper first run is slow — it downloads model weights on first use (~150 MB for tiny). Subsequent runs use the local cache.
whisperx ImportError on ctranslate2 — try pip install ctranslate2==4.1.0 explicitly.
Vosk "model not found" — ensure VOSK_MODEL_PATH points to the extracted directory (not the zip).
whisper.cpp "command not found" — set WHISPER_CPP_BIN to the full path of the whisper-cli binary, or place it in the project root.
Audio format errors — the backend accepts WAV, MP3, FLAC, OGG, M4A. Vosk converts internally via ffmpeg; ensure ffmpeg is in PATH.
High memory on medium model — faster-whisper medium requires ~1.5 GB RAM. Use tiny or base for constrained environments.
EADDRINUSE on start — start.sh uses fuser to kill any existing process on the port before starting. If fuser is not installed: sudo apt install psmisc.