whisper.ihm

Offline speech-to-text transcription tool for long audio files. Built with whisper.cpp and ten-vad for voice activity detection.

Features

MP3 input with automatic resampling to 16kHz mono
VAD-based segmentation — splits audio by silence, transcribes each chunk
Accurate timestamps per segment
Runs fully offline, no API keys required
Metal GPU acceleration on macOS (Apple Silicon)

Install

Homebrew (macOS)

brew install tggo/tap/whisper-ihm

From source

Requires Go 1.23+, CMake, Git.

git clone https://github.com/tggo/whisper.ihm.git && cd whisper.ihm
make setup   # clones deps, builds whisper.cpp, downloads model (~3 GB)
make build   # compiles the binary
./whisper-ihm recording.mp3

Usage

Usage: whisper-ihm [flags] <input.mp3>

Flags:
  -model string    Path to GGML model (default "models/ggml-large-v3.bin")
  -lang string     Language code (default "auto")
  -threads int     Number of threads (default: all CPUs)
  -help            Show help

Output format

[00:00:01.200 -> 00:00:05.800] Hello, how are you today?
[00:00:06.100 -> 00:00:09.400] I'm doing well, thank you.

Install from release

Download a pre-built binary from Releases:

# macOS (Apple Silicon)
curl -L https://github.com/tggo/whisper.ihm/releases/latest/download/whisper-ihm-darwin-arm64.tar.gz | tar xz

# Linux (amd64)
curl -L https://github.com/tggo/whisper.ihm/releases/latest/download/whisper-ihm-linux-amd64.tar.gz | tar xz

# Download the whisper model (~3 GB)
mkdir -p models
curl -L -o models/ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

./whisper-ihm recording.mp3

Docker

# Build the image (Linux, CPU-only)
docker build -t whisper-ihm .

# Download model and transcribe
docker run -v $(pwd)/data:/data whisper-ihm -model /data/ggml-large-v3.bin /data/recording.mp3

The Dockerfile uses a multi-stage build: golang:1.23-bookworm for building (clones whisper.cpp + ten-vad, compiles with CGO), debian:bookworm-slim for runtime.

Build details

-trimpath strips local filesystem paths from the binary
macOS builds include Metal GPU acceleration
Linux/Docker builds are CPU-only

How it works

Decode MP3 to PCM, resample to 16kHz mono
Run VAD (ten-vad) to detect speech segments, split on ~500ms silence gaps
Feed each segment to whisper.cpp with timestamp offsets
Print [start -> end] text for each whisper segment

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
patches		patches
testdata		testdata
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dedup.go		dedup.go
go.mod		go.mod
go.sum		go.sum
golden_test.go		golden_test.go
hallucinations.go		hallucinations.go
hallucinations_test.go		hallucinations_test.go
main.go		main.go
vad.go		vad.go
whisper_quiet.go		whisper_quiet.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whisper.ihm

Features

Install

Homebrew (macOS)

From source

Usage

Output format

Install from release

Docker

Build details

How it works

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

whisper.ihm

Features

Install

Homebrew (macOS)

From source

Usage

Output format

Install from release

Docker

Build details

How it works

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages