Skip to content

jayeshthk/STT-benchmark-engine

Repository files navigation

STT Benchmark

A TypeScript + Python web UI for benchmarking open-source speech-to-text models on CPU.

Model Variants Notes
faster-whisper tiny · base · small · medium CTranslate2-optimised Whisper
whisper.cpp base.en (configurable) Pure C++ inference
Vosk small-en (configurable) Kaldi-based, fully offline
WhisperX base (configurable) Whisper + forced alignment

Prerequisites

Tool Minimum version Check
Node.js 18 LTS node -v
Python 3.10 python3 --version
ffmpeg any recent ffmpeg -version
cmake any recent cmake --version
whisper.cpp binary see §4

Install system dependencies

Ubuntu / Debian

sudo apt update && sudo apt install -y ffmpeg python3-pip python3-venv cmake

macOS (Homebrew)

brew install ffmpeg node python cmake

Windows


1. Clone & install Node dependencies

git clone <repo-url> stt-benchmark
cd stt-benchmark
npm install

2. Install Python dependencies

It is strongly recommended to use a virtual environment.

python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate

# CPU-only PyTorch (saves ~2 GB vs the CUDA wheel)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu

# All other deps
pip install -r requirements.txt

Note: whisperx has a transitive dependency on ctranslate2. If the pip install fails, try upgrading pip first: pip install --upgrade pip.


3. Install Vosk model

mkdir -p models
cd models

# Small English model (~40 MB) — fastest
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

# Full English model (~1.8 GB) — more accurate
# wget https://alphacephei.com/vosk/models/vosk-model-en-us-0.22.zip
# unzip vosk-model-en-us-0.22.zip

By default the server looks for:

models/vosk-model-small-en-us-0.15/

Override with an environment variable:

export VOSK_MODEL_PATH=/absolute/path/to/vosk-model

4. Install whisper.cpp

Option A — build from source (recommended)

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)

# Download a GGML model
bash models/download-ggml-model.sh base.en

# Copy binary and model into the project
cp build/bin/whisper-cli ../stt-benchmark/whisper-cli
cp models/ggml-base.en.bin ../stt-benchmark/models/
cd ..

Option B — pre-built binary (Linux x86_64)

wget https://github.com/ggerganov/whisper.cpp/releases/latest/download/whisper.cpp-linux-x64.tar.gz
tar -xzf whisper.cpp-linux-x64.tar.gz

Environment variables:

export WHISPER_CPP_BIN=/path/to/whisper-cli      # default: "whisper-cli" in project root
export WHISPER_CPP_MODEL=/path/to/ggml-base.en.bin   # default: models/ggml-base.en.bin

The binary produced by recent whisper.cpp builds is named whisper-cli, not main.


5. Configure credentials

Copy the example env file and set your credentials:

cp .env.example .env

Edit .env:

AUTH_USER=your_username
AUTH_PASS=your_password
PORT=3001

The server requires HTTP Basic Auth on all API routes. The frontend login screen reads the same credentials. .env is gitignored and never committed.


6. Build the TypeScript frontend

# Compile app.ts → frontend/dist/app.js, then copy next to index.html
npm run build:frontend
cp frontend/dist/app.js frontend/app.js

7. Start and stop the server

Use the provided scripts (they handle venv activation, port cleanup, and health checking):

# Start (waits up to 20 s for the server to be ready)
./start.sh

# Stop
./stop.sh

Or run manually:

source .venv/bin/activate
npm run dev

The server starts on http://localhost:3001 (or the PORT set in .env).

When accessed over a network (e.g. a VM's public IP), use the machine's IP directly — e.g. http://34.x.x.x:3001. All API calls use relative URLs so they follow the page origin.


8. Test each model individually

All scripts accept an audio file path and output JSON. You can test them directly:

faster-whisper

python3 backend/scripts/run_faster_whisper.py sample.wav tiny
python3 backend/scripts/run_faster_whisper.py sample.wav base
python3 backend/scripts/run_faster_whisper.py sample.wav small
python3 backend/scripts/run_faster_whisper.py sample.wav medium

WhisperX

python3 backend/scripts/run_whisperx.py sample.wav

Vosk

python3 backend/scripts/vosk_transcribe.py sample.wav models/vosk-model-small-en-us-0.15

whisper.cpp (via curl with auth)

curl -u emtester:supress232 -X POST http://localhost:3001/transcribe/whisper-cpp \
  -F "audio=@sample.wav"

All at once via the API

AUTH="-u emtester:supress232"

# List available models
curl $AUTH http://localhost:3001/models

# Transcribe with faster-whisper base
curl $AUTH -X POST http://localhost:3001/transcribe/faster-whisper-base \
  -F "audio=@sample.wav"

# View saved benchmark results
curl $AUTH http://localhost:3001/benchmark

API Reference

All endpoints require HTTP Basic Auth (Authorization: Basic <base64>) or the X-API-Key: <base64(user:pass)> header.

Single-file benchmarking

Method Endpoint Description
GET /models List all available model IDs
POST /transcribe/:modelId Upload audio (multipart audio field) and transcribe
GET /benchmark Return all saved single-file results
DELETE /benchmark Clear all single-file results
DELETE /benchmark/:id Delete one result by ID

Batch / ZIP benchmarking

Method Endpoint Description
POST /benchmark-batch Upload a ZIP containing audio files + mapping.json
GET /benchmark-batch Return all past batch analyses
DELETE /benchmark-batch Clear all batch results
DELETE /benchmark-batch/:id Delete one batch result by ID

ZIP structure (flexible — files may be at root or inside a single folder):

my-test.zip
├── mapping.json          ← required: maps filename → reference transcript
├── audio1.wav
└── audio2.wav

mapping.json format:

{
  "audio1.wav": "The reference transcript for audio one.",
  "audio2.wav": "Another reference transcript."
}

The batch endpoint runs every selected model against every audio file and returns per-file WER (Word Error Rate) and CER (Character Error Rate) alongside transcription time and audio duration.

Audio duration backfill

Method Endpoint Description
GET /backfill-durations/scan Scan uploads/ and backfill durations for existing results
POST /backfill-durations/zip Upload a ZIP to backfill durations for results that lost their audio

Model IDs

ID Description
faster-whisper-tiny faster-whisper, tiny variant
faster-whisper-base faster-whisper, base variant
faster-whisper-small faster-whisper, small variant
faster-whisper-medium faster-whisper, medium variant
whisper-cpp whisper.cpp, ggml-base.en
vosk Vosk small-en
whisperx WhisperX base

Single-file response shape

{
  "id": "uuid",
  "model": "faster-whisper-base",
  "variant": "base",
  "transcription": "Hello world this is a test.",
  "timeTakenMs": 1842,
  "cpuPercent": 98.5,
  "audioFile": "test.wav",
  "audioDurationMs": 4200,
  "timestamp": "2026-04-16T12:34:56.000Z",
  "error": null
}

Project Structure

stt-benchmark/
├── backend/
│   ├── server.ts                  ← Express REST API (ts-node)
│   └── scripts/
│       ├── run_faster_whisper.py  ← faster-whisper inference
│       ├── run_whisperx.py        ← WhisperX inference
│       └── vosk_transcribe.py     ← Vosk inference
├── frontend/
│   ├── index.html                 ← Single-page UI + login overlay
│   ├── app.ts                     ← TypeScript source
│   └── app.js                     ← Compiled output (copy from dist/ after build)
├── models/                        ← Vosk model dir + GGML .bin (not committed)
├── uploads/                       ← Temporary audio uploads (runtime, not committed)
├── .env                           ← Credentials & port (not committed — see .env.example)
├── .env.example                   ← Template for .env
├── benchmark_results.json         ← Persistent single-file results (runtime)
├── batch_results.json             ← Persistent batch results (runtime)
├── whisper-cli                    ← whisper.cpp binary (not committed)
├── start.sh                       ← Start server with health check
├── stop.sh                        ← Stop server cleanly
├── package.json
├── tsconfig.json                  ← Backend TS config
├── tsconfig.frontend.json         ← Frontend TS config
└── requirements.txt

Troubleshooting

Login screen shown / 401 errors — ensure .env exists with correct AUTH_USER/AUTH_PASS. The frontend uses HTTP Basic Auth; credentials are stored in sessionStorage for the browser session only.

CORS / private network error in browser — if accessing via a VM's public IP, open the app at http://<public-ip>:3001 directly, not via a proxy that rewrites the origin to localhost.

faster-whisper first run is slow — it downloads model weights on first use (~150 MB for tiny). Subsequent runs use the local cache.

whisperx ImportError on ctranslate2 — try pip install ctranslate2==4.1.0 explicitly.

Vosk "model not found" — ensure VOSK_MODEL_PATH points to the extracted directory (not the zip).

whisper.cpp "command not found" — set WHISPER_CPP_BIN to the full path of the whisper-cli binary, or place it in the project root.

Audio format errors — the backend accepts WAV, MP3, FLAC, OGG, M4A. Vosk converts internally via ffmpeg; ensure ffmpeg is in PATH.

High memory on medium model — faster-whisper medium requires ~1.5 GB RAM. Use tiny or base for constrained environments.

EADDRINUSE on startstart.sh uses fuser to kill any existing process on the port before starting. If fuser is not installed: sudo apt install psmisc.

About

opensourced STT bechmark setup. upload multiple audio files / zip file to test opensourced stt capabilities on your data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors