Offline speech-to-text transcription tool for long audio files. Built with whisper.cpp and ten-vad for voice activity detection.
- MP3 input with automatic resampling to 16kHz mono
- VAD-based segmentation — splits audio by silence, transcribes each chunk
- Accurate timestamps per segment
- Runs fully offline, no API keys required
- Metal GPU acceleration on macOS (Apple Silicon)
brew install tggo/tap/whisper-ihmRequires Go 1.23+, CMake, Git.
git clone https://github.com/tggo/whisper.ihm.git && cd whisper.ihm
make setup # clones deps, builds whisper.cpp, downloads model (~3 GB)
make build # compiles the binary
./whisper-ihm recording.mp3Usage: whisper-ihm [flags] <input.mp3>
Flags:
-model string Path to GGML model (default "models/ggml-large-v3.bin")
-lang string Language code (default "auto")
-threads int Number of threads (default: all CPUs)
-help Show help
[00:00:01.200 -> 00:00:05.800] Hello, how are you today?
[00:00:06.100 -> 00:00:09.400] I'm doing well, thank you.
Download a pre-built binary from Releases:
# macOS (Apple Silicon)
curl -L https://github.com/tggo/whisper.ihm/releases/latest/download/whisper-ihm-darwin-arm64.tar.gz | tar xz
# Linux (amd64)
curl -L https://github.com/tggo/whisper.ihm/releases/latest/download/whisper-ihm-linux-amd64.tar.gz | tar xz
# Download the whisper model (~3 GB)
mkdir -p models
curl -L -o models/ggml-large-v3.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin
./whisper-ihm recording.mp3# Build the image (Linux, CPU-only)
docker build -t whisper-ihm .
# Download model and transcribe
docker run -v $(pwd)/data:/data whisper-ihm -model /data/ggml-large-v3.bin /data/recording.mp3The Dockerfile uses a multi-stage build: golang:1.23-bookworm for building (clones whisper.cpp + ten-vad, compiles with CGO), debian:bookworm-slim for runtime.
-trimpathstrips local filesystem paths from the binary- macOS builds include Metal GPU acceleration
- Linux/Docker builds are CPU-only
- Decode MP3 to PCM, resample to 16kHz mono
- Run VAD (ten-vad) to detect speech segments, split on ~500ms silence gaps
- Feed each segment to whisper.cpp with timestamp offsets
- Print
[start -> end] textfor each whisper segment
MIT