Skip to content

voeykovroman/Krill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Krill — Live Audio Transcription Using NVIDIA NeMo Models

Krill is a lightweight wrapper around the NVIDIA NeMo framework, and specifically for Parakeet model, that makes it simple to transcribe live audio streams in real time. It could be extended to transcribe meetings, calls, interviews, and other spoken-word audio — with speaker diarisation running alongside transcription.

The main motivation behind this wrapper is to generate real-time live transcripts by running processing locally or on a server in a private network, without using third-party APIs. Audio never leaves your device or infrastructure, which probably makes it a somewhat suitable choice as a fully transparent tool for transcribing of sensitive conversations.

It runs on a MacBook (tested on M3) without a GPU, and can scale up to a GPU server for lower latency.


How It Works

The client captures microphone audio in small chunks and streams it over a WebSocket. The server runs inside Docker, feeds each chunk through a NeMo streaming ASR model (and optionally a speaker diarisation model), and sends back a JSON response with the transcribed text, timestamps, and speaker segments. The client prints results incrementally — new words appear inline on the same line, and a new paragraph starts only when the next speech segment begins. Both the server and the client can run on the same machine (server in Docker, client in a local Python environment) or on separate machines within the same network.

At the moment, it is mainly intended for a live microphone stream, but you can extend the client to stream pre-recorded audio files or other live audio sources in small chunks over the WebSocket in the same way.

Prerequisites

  • Docker — to run the inference server.
  • Python 3.8+ — for the local client (a lightweight venv with sounddevice and websocket-client).
  • Microphone — for live audio capture.

Tested on a MacBook Pro M3. No GPU required for CPU inference (Apple Silicon performance cores are used automatically).

Setup & Running

1. Server (Docker)

The server runs inside Docker to isolate the heavy NeMo / PyTorch / Vosk dependencies.

  1. Build the image:

    make build
  2. Run the container (drops you into a shell inside it):

    make run NETWORK="-p 9000:9000"

    To also expose Jupyter for experimentation:

    make run NETWORK="-p 9000:9000 -p 8888:8888"
  3. Start the inference server inside the container:

    python inference/inference.py --model-name "nvidia/parakeet-tdt-0.6b-v3" --diarisation

    The server will load the model, print context sizes and theoretical latency, then listen on port 9000.

    Add --verbose if you want to print incremental transcriptions in the server console as well.

  4. Stop the server with Ctrl+C. On exit, a full session summary is printed — each segment with its timestamp and active speakers (if diarisation was enabled).

2. Client (Local)

  1. Set up the virtual environment (first time only):

    python3 -m venv venv_client
    source venv_client/bin/activate
    pip install -r client/requirements.txt
  2. Run the client:

    source venv_client/bin/activate
    python -m client.capture_and_send --blocksize 15360 --host localhost --port 9000

    Transcriptions appear incrementally in the terminal — words are appended inline within a segment and a new line starts when a new segment begins.

Configuration

Streaming Strategies

  1. buffer - This is the default choice and works with the nvidia/parakeet-tdt-0.6b-v3 model. It provides the main benefits of that model: high accuracy, multilingual support with automatic language detection, punctuation, and more. It uses buffered streaming with explicit left/chunk/right context windows. Optionally, it can also run together with speaker diarisation, specifically with the nvidia/diar_streaming_sortformer_4spk-v2.1 diarisation model.
  2. cache - Based on NVIDIA's demo for FastConformer hybrid cache-aware streaming; the model maintains its own internal state across chunks. It works with stt_en_fastconformer_hybrid_large_streaming_multi and similar streaming FastConformer models. This mode uses chunk-based streaming with internal state caching. No diarisation support yet.
  3. vosk - Lightweight CPU-only fallback using Vosk models. No NeMo dependency needed for this path.

Result Format

Every processed chunk returns a dict (serialised as JSON over the WebSocket):

{
  "segment_start_time": 1710000000.123,
  "segment_start_time_formatted": "2026-03-19 14:23:00",
  "text": "hello this is a test",
  "word_timestamps": [...],
  "segment_timestamps": [...],
  "speaker_segments": {
    "spk_0": [[0.0, 1.4], [3.2, 5.1]],
    "spk_1": [[1.6, 3.1]]
  }
}

word_timestamps and segment_timestamps are populated only in buffer mode. speaker_segments is populated only in buffer mode when diarisation is enabled.


Supported Models (--model-name)

  • nvidia/parakeet-tdt-0.6b-v3 (multi-language model; see for more details; buffered streaming). With this model, you may want to adjust --chunk-secs, --left-context-secs, and --right-context-secs for optimal performance. However, the default values should work reasonably well for general use, especially when a constant 1.5-2s latency is acceptable. The client --blocksize should be set to 15360 (0.96 s at 16 kHz) for smooth streaming with the default --chunk-secs value. Supported Languages: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Ukrainian (uk)

  • stt_en_fastconformer_hybrid_large_streaming_multi (English-only model with cache-aware streaming). With this model, you may want to adjust only --lookahead-size, since --encoder-step-length will most likely be 80 ms for most FastConformer models. Supported lookahead sizes are 0, 80, 480, and 1040 ms. See for more details. Other streaming FastConformer models from NVIDIA NeMo can also be used, including models for other languages. With default settings, it expects chunks of 17920 samples (1.12 s at 16 kHz) from the client.

  • vosk-model-en-us-0.22 (English-only, lightweight). Any Vosk model can be used here as well, so other languages may be supported depending on the model. See Vosk Models for more details. In the Vosk case, nothing needs to be adjusted on the server side. It can work with any chunk size from the client, but it was tested with 17920 samples (1.12 s at 16 kHz) as well.

Server Arguments (inference/inference.py)

  • --host (default: 0.0.0.0) — Bind address.
  • --port (default: 9000) — WebSocket port.
  • --device (default: cpu) — cpu or cuda.
  • --verbose — Print incremental transcriptions to the server console as they arrive.

ASR model:

  • --model-name (default: nvidia/parakeet-tdt-0.6b-v3) — ASR model to load.
  • --chunk-secs (default: 1.0) — Chunk size in seconds (buffer strategy).
  • --left-context-secs (default: 6.0) — Left context window in seconds (buffer strategy).
  • --right-context-secs (default: 1.0) — Right context window in seconds (buffer strategy).
  • --lookahead-size (default: 1040) — Lookahead in ms (cache strategy).
  • --encoder-step-length (default: 80) — Encoder step length in ms (cache strategy).
  • --use-transcribe-method — Use NeMo's .transcribe() instead of streaming; not suitable for live audio, useful for quick testing.

Diarisation:

  • --diarisation — Enable speaker diarisation (only supported with buffer strategy).
  • --diarisation-model-name (default: nvidia/diar_streaming_sortformer_4spk-v2.1) — Diarisation model to load.
  • --diarisation-threshold (default: 0.5) — Probability threshold for marking a speaker as active.
  • --max-diarisation-preds (default: 1000) — Maximum number of frame-level predictions to keep in memory.
  • --spkcache-update-period (default: 340) — How often (in encoder steps) the speaker cache is updated.
  • --spkcache-len (default: 340) — Length of the speaker cache in encoder steps.

Client Arguments (client/capture_and_send.py)

  • --host (default: localhost) — Server hostname or IP address.
  • --port (default: 9000) — Server port.
  • --samplerate (default: 16000) — Microphone sample rate in Hz.
  • --channels (default: 1) — Number of audio channels.
  • --blocksize (default: 8000) — Number of samples per chunk sent to the server. It should match the model's expected chunk size (for example, 15360 for Parakeet and 17920 for FastConformer).
  • --device — Input device ID. Omit to use the system default.
  • --list-devices — Print all available audio input devices and exit.

Troubleshooting

  • Port conflicts: Make sure port 9000 is free on your host machine. Configure a different port on both the server and client if needed.
  • Wrong microphone: Run python client/capture_and_send.py --list-devices to list available inputs and pass the correct ID with --device.
  • Slow / high latency: Try reducing --left-context-secs, and optionally --chunk-secs as well, especially when processing on CPU. These suggestions apply mainly to the buffer strategy.

Known Issues

  • With the current implementation of the buffer strategy and the nvidia/parakeet-tdt-0.6b-v3 model, some words or larger speech segments may occasionally be missed, causing the Stream Manager to reset its state. This happens more frequently during long periods of uninterrupted speech. Buffer parameters such as chunk size and left/right context size may need additional tuning for better performance in this case.

  • Overall, ASR quality is not ideal, and there are occasional misrecognitions, especially with quiet speech or in noisy environments. Probably some additional tuning of the encoding or decoding parameters is needed, or more appropriate chunk-sizes could be used.

  • Diarisation quality is not ideal either, especially when speakers have similar voices, there is a lot of cross-talk, or even when there is only a single speaker and some parts of the speech are assigned as if multiple speakers were present. It may be improved by adjusting the diarisation speaker cache length and update period parameters.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

About

Wrapper for various ASR models to use it with real-time input audio-stream (i.e. micropone input, audio from the call, etc.)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages