Krill is a lightweight wrapper around the NVIDIA NeMo framework, and specifically for Parakeet model, that makes it simple to transcribe live audio streams in real time. It could be extended to transcribe meetings, calls, interviews, and other spoken-word audio — with speaker diarisation running alongside transcription.
The main motivation behind this wrapper is to generate real-time live transcripts by running processing locally or on a server in a private network, without using third-party APIs. Audio never leaves your device or infrastructure, which probably makes it a somewhat suitable choice as a fully transparent tool for transcribing of sensitive conversations.
It runs on a MacBook (tested on M3) without a GPU, and can scale up to a GPU server for lower latency.
The client captures microphone audio in small chunks and streams it over a WebSocket. The server runs inside Docker, feeds each chunk through a NeMo streaming ASR model (and optionally a speaker diarisation model), and sends back a JSON response with the transcribed text, timestamps, and speaker segments. The client prints results incrementally — new words appear inline on the same line, and a new paragraph starts only when the next speech segment begins. Both the server and the client can run on the same machine (server in Docker, client in a local Python environment) or on separate machines within the same network.
At the moment, it is mainly intended for a live microphone stream, but you can extend the client to stream pre-recorded audio files or other live audio sources in small chunks over the WebSocket in the same way.
- Docker — to run the inference server.
- Python 3.8+ — for the local client (a lightweight
venvwithsounddeviceandwebsocket-client). - Microphone — for live audio capture.
Tested on a MacBook Pro M3. No GPU required for CPU inference (Apple Silicon performance cores are used automatically).
The server runs inside Docker to isolate the heavy NeMo / PyTorch / Vosk dependencies.
-
Build the image:
make build
-
Run the container (drops you into a shell inside it):
make run NETWORK="-p 9000:9000"To also expose Jupyter for experimentation:
make run NETWORK="-p 9000:9000 -p 8888:8888" -
Start the inference server inside the container:
python inference/inference.py --model-name "nvidia/parakeet-tdt-0.6b-v3" --diarisationThe server will load the model, print context sizes and theoretical latency, then listen on port
9000.Add
--verboseif you want to print incremental transcriptions in the server console as well. -
Stop the server with
Ctrl+C. On exit, a full session summary is printed — each segment with its timestamp and active speakers (if diarisation was enabled).
-
Set up the virtual environment (first time only):
python3 -m venv venv_client source venv_client/bin/activate pip install -r client/requirements.txt -
Run the client:
source venv_client/bin/activate python -m client.capture_and_send --blocksize 15360 --host localhost --port 9000Transcriptions appear incrementally in the terminal — words are appended inline within a segment and a new line starts when a new segment begins.
buffer- This is the default choice and works with thenvidia/parakeet-tdt-0.6b-v3model. It provides the main benefits of that model: high accuracy, multilingual support with automatic language detection, punctuation, and more. It uses buffered streaming with explicit left/chunk/right context windows. Optionally, it can also run together with speaker diarisation, specifically with thenvidia/diar_streaming_sortformer_4spk-v2.1diarisation model.cache- Based on NVIDIA's demo for FastConformer hybrid cache-aware streaming; the model maintains its own internal state across chunks. It works withstt_en_fastconformer_hybrid_large_streaming_multiand similar streaming FastConformer models. This mode uses chunk-based streaming with internal state caching. No diarisation support yet.vosk- Lightweight CPU-only fallback using Vosk models. No NeMo dependency needed for this path.
Every processed chunk returns a dict (serialised as JSON over the WebSocket):
{
"segment_start_time": 1710000000.123,
"segment_start_time_formatted": "2026-03-19 14:23:00",
"text": "hello this is a test",
"word_timestamps": [...],
"segment_timestamps": [...],
"speaker_segments": {
"spk_0": [[0.0, 1.4], [3.2, 5.1]],
"spk_1": [[1.6, 3.1]]
}
}word_timestamps and segment_timestamps are populated only in buffer mode. speaker_segments is populated only in buffer mode when diarisation is enabled.
-
nvidia/parakeet-tdt-0.6b-v3(multi-language model; see for more details; buffered streaming). With this model, you may want to adjust--chunk-secs,--left-context-secs, and--right-context-secsfor optimal performance. However, the default values should work reasonably well for general use, especially when a constant 1.5-2s latency is acceptable. The client--blocksizeshould be set to15360(0.96 s at 16 kHz) for smooth streaming with the default--chunk-secsvalue. Supported Languages: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk) -
stt_en_fastconformer_hybrid_large_streaming_multi(English-only model with cache-aware streaming). With this model, you may want to adjust only--lookahead-size, since--encoder-step-lengthwill most likely be 80 ms for most FastConformer models. Supported lookahead sizes are 0, 80, 480, and 1040 ms. See for more details. Other streaming FastConformer models from NVIDIA NeMo can also be used, including models for other languages. With default settings, it expects chunks of17920samples (1.12 s at 16 kHz) from the client. -
vosk-model-en-us-0.22(English-only, lightweight). Any Vosk model can be used here as well, so other languages may be supported depending on the model. See Vosk Models for more details. In the Vosk case, nothing needs to be adjusted on the server side. It can work with any chunk size from the client, but it was tested with17920samples (1.12 s at 16 kHz) as well.
--host(default:0.0.0.0) — Bind address.--port(default:9000) — WebSocket port.--device(default:cpu) —cpuorcuda.--verbose— Print incremental transcriptions to the server console as they arrive.
ASR model:
--model-name(default:nvidia/parakeet-tdt-0.6b-v3) — ASR model to load.--chunk-secs(default:1.0) — Chunk size in seconds (bufferstrategy).--left-context-secs(default:6.0) — Left context window in seconds (bufferstrategy).--right-context-secs(default:1.0) — Right context window in seconds (bufferstrategy).--lookahead-size(default:1040) — Lookahead in ms (cachestrategy).--encoder-step-length(default:80) — Encoder step length in ms (cachestrategy).--use-transcribe-method— Use NeMo's.transcribe()instead of streaming; not suitable for live audio, useful for quick testing.
Diarisation:
--diarisation— Enable speaker diarisation (only supported withbufferstrategy).--diarisation-model-name(default:nvidia/diar_streaming_sortformer_4spk-v2.1) — Diarisation model to load.--diarisation-threshold(default:0.5) — Probability threshold for marking a speaker as active.--max-diarisation-preds(default:1000) — Maximum number of frame-level predictions to keep in memory.--spkcache-update-period(default:340) — How often (in encoder steps) the speaker cache is updated.--spkcache-len(default:340) — Length of the speaker cache in encoder steps.
--host(default:localhost) — Server hostname or IP address.--port(default:9000) — Server port.--samplerate(default:16000) — Microphone sample rate in Hz.--channels(default:1) — Number of audio channels.--blocksize(default:8000) — Number of samples per chunk sent to the server. It should match the model's expected chunk size (for example,15360for Parakeet and17920for FastConformer).--device— Input device ID. Omit to use the system default.--list-devices— Print all available audio input devices and exit.
- Port conflicts: Make sure port
9000is free on your host machine. Configure a different port on both the server and client if needed. - Wrong microphone: Run
python client/capture_and_send.py --list-devicesto list available inputs and pass the correct ID with--device. - Slow / high latency: Try reducing
--left-context-secs, and optionally--chunk-secsas well, especially when processing on CPU. These suggestions apply mainly to thebufferstrategy.
-
With the current implementation of the
bufferstrategy and thenvidia/parakeet-tdt-0.6b-v3model, some words or larger speech segments may occasionally be missed, causing the Stream Manager to reset its state. This happens more frequently during long periods of uninterrupted speech. Buffer parameters such as chunk size and left/right context size may need additional tuning for better performance in this case. -
Overall, ASR quality is not ideal, and there are occasional misrecognitions, especially with quiet speech or in noisy environments. Probably some additional tuning of the encoding or decoding parameters is needed, or more appropriate chunk-sizes could be used.
-
Diarisation quality is not ideal either, especially when speakers have similar voices, there is a lot of cross-talk, or even when there is only a single speaker and some parts of the speech are assigned as if multiple speakers were present. It may be improved by adjusting the diarisation speaker cache length and update period parameters.
Licensed under the Apache License, Version 2.0. See LICENSE for details.