Real-time speech-to-text transcription server using whisper.cpp with WebSocket streaming. Optimized for Apple Silicon with Metal GPU acceleration.
- Real-time streaming transcription via WebSocket
- Batch transcription via HTTP WAV upload (
POST /v1/transcribe) - Voice Activity Detection (VAD) - automatically detects speech end and emits final transcripts
- Metal GPU acceleration on Apple Silicon (M1/M2/M3/M4)
- Multi-client support with context pooling (configurable parallel sessions)
- Low latency (~300-500ms inference on M-series chips)
- 24/7 operation - designed for always-on deployments via launchd
- macOS with Apple Silicon (M1/M2/M3/M4)
- CMake 3.14+
- Xcode Command Line Tools
# Install build tools
brew install cmake
xcode-select --install
# Clone this repo
git clone https://github.com/kingbootoshi/local-mac-audio-transcription.git
cd local-mac-audio-transcription./scripts/build.shThe build script uses CMake FetchContent to automatically download whisper.cpp. No manual cloning required.
./scripts/run.shOr with custom options:
./build/whisper-stream-server \
--model models/ggml-base.en.bin \
--vad-model models/ggml-silero-vad.bin \
--port 9090 \
--contexts 2cd examples/web-client
npm install
npm run devOpen http://localhost:5173, click Connect, then Start Recording.
local-mac-audio-transcription/
├── src/ # C++ server source
│ ├── main.cpp # Entry point, WebSocket handlers
│ ├── whisper_server.cpp # Core transcription + VAD logic
│ ├── whisper_server.hpp
│ ├── audio_buffer.cpp # Thread-safe audio buffer
│ ├── audio_buffer.hpp
│ └── json.hpp # nlohmann/json (auto-downloaded)
│
├── scripts/
│ ├── build.sh # Build script
│ └── run.sh # Run script
│
├── examples/ # Example clients
│ └── web-client/ # TypeScript browser client
│
├── install/ # Mac deployment
│ ├── install.sh # Install as launchd service
│ ├── uninstall.sh
│ └── com.whisper.stream-server.plist
│
├── docs/
│ ├── ARCHITECTURE.md # Server design deep-dive
│ ├── API.md # WebSocket protocol reference
│ └── CPP_GUIDE.md # C++ learning guide
│
├── CMakeLists.txt
└── README.md
| Option | Default | Description |
|---|---|---|
--model |
(required) | Path to whisper model |
--vad-model |
(required) | Path to VAD model |
--port |
9090 |
WebSocket server port |
--host |
0.0.0.0 |
Bind address |
--token |
(none) | Authentication token for WebSocket connections |
--contexts |
2 |
Number of parallel transcription contexts |
--threads |
4 |
CPU threads per inference |
--step |
500 |
Inference interval (ms) |
--length |
5000 |
Audio context window (ms) |
--keep |
200 |
Overlap between windows (ms) |
--vad-threshold |
0.5 |
Voice activity detection threshold |
--vad-silence |
1000 |
Silence duration to trigger final (ms) |
--language |
en |
Language code |
--no-gpu |
- | Disable Metal GPU |
--max-upload-size |
25 |
Max batch upload size in MB |
--max-audio-duration |
300000 |
Max batch duration in ms |
--batch-timeout |
10000 |
Batch context wait timeout (ms) |
Set INFO_LEVEL=debug to enable verbose logs (batch inference timing, resample info).
Default is info, but ./scripts/run.sh sets INFO_LEVEL=debug unless overridden.
- Binary frames: 16-bit signed PCM audio at 16kHz mono
{ "type": "ready", "model": "base.en", "contexts": 2 }
{ "type": "partial", "text": "Hello how are" }
{ "type": "final", "text": "Hello, how are you?" }
{ "type": "error", "message": "..." }For any deployment accessible over a network, use token authentication:
Server:
./build/whisper-stream-server --token YOUR_SECRET_TOKEN ...Client:
const ws = new WebSocket('ws://host:port?token=YOUR_SECRET_TOKEN');Connections without a valid token are rejected with HTTP 401.
# Build first
./scripts/build.sh
# Install (downloads models automatically)
sudo ./install/install.sh --token YOUR_SECRET
# With all options
sudo ./install/install.sh --token YOUR_SECRET --port 9090 --contexts 4The install script will:
- Build the binary if not already built
- Download the whisper model (~148 MB) from Hugging Face
- Download the VAD model (~1 MB) from Hugging Face
- Install binary to
/usr/local/bin/ - Install models to
/usr/local/share/whisper/ - Create and start a launchd service
# Check status
sudo launchctl list | grep whisper
# View logs
tail -f /usr/local/var/log/whisper-stream-server.log
# Stop
sudo launchctl unload /Library/LaunchDaemons/com.whisper.stream-server.plist
# Start
sudo launchctl load /Library/LaunchDaemons/com.whisper.stream-server.plist
# Verify port is listening
lsof -nP -iTCP:9090 -sTCP:LISTEN# If token auth enabled
wscat -c 'ws://localhost:9090?token=YOUR_SECRET'
# If no token
wscat -c ws://localhost:9090| Contexts | Model | RAM |
|---|---|---|
| 2 | base.en | ~850 MB |
| 4 | base.en | ~1.6 GB |
| 2 | small.en | ~1.8 GB |
| 4 | small.en | ~3.5 GB |
- ARCHITECTURE.md - Threading, context pooling, VAD
- API.md - WebSocket protocol, client examples
- CPP_GUIDE.md - C++ learning guide
MIT