Voice input for CLI tools via fully local Voxtral Mini 4B transcription. Everything runs locally on your machine.
Record a voice prompt, transcribe it, and launch a CLI tool with the result:
# Transcribe and pass to claude as the initial prompt
speak-to claude
# Same, with extra arguments forwarded to the client
speak-to claude -- --model sonnet
# Transcribe to stdout (for piping)
speak-toThe flow: load model, record until you press Enter, transcribe, exec the client with the transcribed text as the first argument.
Wrap any interactive CLI tool in a PTY and trigger voice input at any time with Ctrl+\:
speak-to -i claude
speak-to -i kiro-cli
speak-to -i claude -- --model sonnetThe client runs normally in your terminal. Press Ctrl+\ to start recording, speak, then press Enter. The transcribed text is injected into the client's input as a paste. Press Ctrl+C during recording to cancel.
The model stays loaded in memory between recordings so subsequent transcriptions are fast.
Requires a working Rust toolchain.
cargo install --path .On first run, the Q4 model (~4GB) is downloaded from HuggingFace and cached locally.
speak-to --downloadmacOS will prompt for microphone access on first use. Grant it to your terminal emulator (iTerm2, Terminal.app, etc.).
The default is Voxtral Mini Q4 (quantized, ~4GB download, ~700MB memory). For lower word error rate at the cost of ~9GB memory:
speak-to --f32 claude
speak-to --download --f32You can also point to a local model:
# Q4 GGUF file
speak-to --model-path /path/to/voxtral-q4.gguf claude
# F32 SafeTensors directory
speak-to --model-path /path/to/model-dir/ claudeOne-shot mode records from your mic, transcribes locally on the GPU via burn, and execs the target CLI with the result.
Interactive mode wraps the client in a PTY so speak-to sits between you and the client. It intercepts the trigger keystroke, records and transcribes, then pastes the text into the client's prompt. The client has no idea anything special happened -- it just sees text appear as if you typed it.
- macOS (uses CoreAudio for recording and system sounds)
- GPU with wgpu support (Metal on macOS)
- ~4GB disk for the Q4 model (~9GB for F32)
Transcription is powered by voxtral-mini-realtime-rs by TrevorS, a Rust implementation of Mistral's Voxtral Mini model using the burn framework.
Apache-2.0