Kyutai STT powered speech-to-text service. Streams words in real time from any device on your network. Wayland desktops via push-to-talk, AR glasses via WebSocket relay. Runs entirely on your hardware.
speak → words appear live ✦ no cloud, no subscription, no latency
┌─────────────────────────────────────────────────┐
│ GPU workstation (always on) │
│ │
│ stt-anywhere.py (one systemd service) │
│ ├── push-to-talk keyboard → wtype (local) │
│ ├── :8099 WebSocket relay (remote) │
│ │ │
│ │ both send audio to: │
│ │ ▼ │
│ │ moshi-server :8098 (CUDA) │
│ │ audio → text (~500ms latency) │
│ │ model in VRAM (~2.4 GB) │
│ └──────────────────────────────────────────────│
└────────────────────────┬────────────────────────┘
│
Tailscale / LAN (your network)
│
┌─────────────────┼─────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Desktop │ │ AR Glasses │ │ Laptop │
│ (Wayland) │ │ (Even G2) │ │ (Wayland) │
│ │ │ │ │ │
│ Mod+Space │ │ deltaclaw │ │ Mod+Space │
│ to record │ │ web app │ │ to record │
│ wtype to │ │ tap to talk │ │ wtype to │
│ cursor │ │ voice msgs │ │ cursor │
└─────────────┘ └──────────────┘ └─────────────┘
Press a key, speak, release. Words stream into whatever text field has focus via wtype. Works on niri, Hyprland, Sway, COSMIC, and any wlroots-based compositor.
The relay server accepts WebSocket audio from remote clients (e.g. deltaclaw on EvenRealities G2 glasses) and returns transcribed words as JSON.
- Real-time : words appear while you’re still talking
- Private : audio never leaves your machine or network
- Multi-device : one GPU serves desktops, laptops, glasses, anything
- Set and forget : systemd services start with your session
Add the flake input and import the Home Manager module:
# flake.nix
{
inputs.stt-anywhere.url = "github:mwlaboratories/stt-anywhere";
# in your home-manager config:
imports = [ inputs.stt-anywhere.homeManagerModules.default ];
services.stt-anywhere = {
enable = true;
cudaCapability = "8.6"; # RTX 3060-3090
};
}Rebuild, bind a key to toggle recording, done:
// niri
binds {
Mod+Space { spawn "systemctl" "--user" "kill" "-s" "USR1" "stt-anywhere.service"; } // toggle recording
Mod+T { spawn "sh" "-c" "if systemctl --user is-active --quiet stt-anywhere.service; then systemctl --user stop stt-anywhere.service && notify-send 'stt-anywhere' 'Stopped'; else systemctl --user start stt-anywhere.service && notify-send 'stt-anywhere' 'Started'; fi"; } // start/stop service
}The model (~2.4 GB) downloads from HuggingFace on first run.
services.stt-anywhere = {
enable = true;
cudaCapability = "8.6";
};services.stt-anywhere = {
enable = true;
cudaCapability = "8.6";
serverAddr = "0.0.0.0"; # accept moshi connections over Tailscale
relayPort = 8099; # WebSocket relay for remote clients
};services.stt-anywhere = {
enable = true;
enableServer = false;
serverUrl = "ws://workstation:8098";
};- NVIDIA GPU with CUDA
- NixOS (flake-based)
- Wayland compositor + PipeWire
- NixOS (flake-based)
- No GPU needed when connecting to a remote server
| GPU | Capability |
|---|---|
| RTX 3060-3090 | "8.6" |
| RTX 4060-4090 | "8.9" |
| RTX 5070-5090 | "10.0" |