Skip to content

Feature Request: Support for better offline transcription models #28

@MyButtermilk

Description

@MyButtermilk

Feature Request: Add Local Transcription Support for NVIDIA Parakeet 0.6B-v3 and Canary 1B-v2 (ONNX Models)

Summary

Summarize currently relies on OpenAI Whisper (via whisper.cpp) for offline audio transcription. We propose extending Summarize to support two new open-source speech-to-text models from NVIDIA: Parakeet-TDT 0.6B-v3 and Canary 1B-v2. Both models are multilingual (covering ~25 languages) and are available in ONNX format optimized for CPU-only inference. By integrating these models alongside Whisper, users can transcribe audio locally using state-of-the-art alternatives that require no cloud API and run efficiently on CPU. This will increase Summarize's flexibility and possibly improve transcription speed/accuracy, especially for languages or scenarios where these NVIDIA models excel. Importantly, Whisper will remain the default/fallback transcriber unless a new option is explicitly set.

Model Resources

Each model has an ONNX export (and accompanying tokenizer vocabulary) on Hugging Face:

NVIDIA Parakeet-TDT 0.6B-v3 – Hugging Face repository: istupakov/parakeet-tdt-0.6b-v3-onnx (contains model.onnx and vocab.txt). This is a 600M-parameter FastConformer-TDT model for ASR (25 European languages).

NVIDIA Canary 1B-v2 – Hugging Face repository: istupakov/canary-1b-v2-onnx (contains model.onnx and vocab.txt). This is a 1B-parameter FastConformer+Transformer model for ASR and speech translation (same 25 languages).

Both repositories provide CPU-targeted ONNX files, including INT8-quantized versions for faster CPU inference. The vocab.txt files map wordpieces to IDs (needed by the inference engine). We would leverage these files for local transcription support.

Proposed Implementation

  1. ONNX Inference Integration: Incorporate an ONNX runtime or similar mechanism into Summarize’s transcription pipeline to load and run these models. The existing audio handling and transcription pipeline (used for Whisper) can be reused as much as possible – e.g. using the same audio preprocessing (ffmpeg decode, etc.) and just swapping out the model inference stage to call the ONNX model. For example, the onnx-asr library demonstrates how to load the Parakeet or Canary ONNX and run .recognize() on a WAV file. We could integrate similarly, or directly use ONNX Runtime API in the Summarize codebase to run the model graph given an audio waveform input.

  2. CLI Flag for Model Selection: Add a new command-line flag (e.g. --transcriber) to allow users to choose the transcription model. For example: --transcriber parakeet or --transcriber canary would force Summarize to use the NVIDIA Parakeet or Canary model for transcription. If this flag is not provided (or is set to whisper), Summarize should continue using the current default Whisper transcription path. This ensures backward compatibility – by default nothing changes, and Whisper remains the fallback if the specified model isn't available.

  3. Model File Configuration: Determine how Summarize will locate the ONNX model and vocab files. Possible approaches:

Auto-download: When --transcriber parakeet is used, if the model files are not present, Summarize could download them from the Hugging Face hub (given the URLs above) and cache them. However, these files are large (several GB total), so auto-downloading should be done with user confirmation or pre-installation instructions.

User-specified Path: Alternatively, require the user to download the model files beforehand and provide a path (via an environment variable or config). For example, an environment var similar to SUMMARIZE_WHISPER_CPP_MODEL_PATH could point to the directory containing model.onnx and vocab.txt for the chosen model. Documentation should be updated to instruct how to obtain and specify these files.

  1. CPU-Only Inference: Ensure that the integration runs in CPU mode only. No GPU acceleration is needed or expected for this feature (which aligns with Summarize’s current design of using CPU-based Whisper). The ONNX models provided are optimized for CPU execution (and even offer INT8 quantized versions for efficiency). We might use ONNX Runtime with its CPU execution provider (no CUDA) to run the models. This keeps the setup simple and avoids any GPU dependency.

  2. Model Inference Pipeline: The selected model will take audio (16 kHz WAV) as input and output text. We should mirror how Whisper’s output is handled. Likely steps: load the ONNX model (which may internally consist of an encoder/decoder for ASR), load the vocab.txt into a tokenizer or use a provided tokenizer (the Nemo models use a SentencePiece tokenizer – the vocab.txt is a simple mapping). Then perform transcription on the audio segments. If needed, handle long audio by chunking or streaming. (Parakeet is transducer-based and supports streaming; Canary is encoder-decoder and can transcribe in full segments.) Initially, a straightforward full-audio inference is fine, similar to how Whisper.cpp processes an entire file.

  3. Fallback and Compatibility: If the --transcriber flag is not used, or if the user-selected model fails to load, Summarize should gracefully fall back to the existing Whisper transcription flow. This ensures Summarize remains robust. For example, if a user runs --transcriber parakeet but hasn't downloaded the model, Summarize could emit a warning and revert to Whisper (or prompt instructing to download the model). Whisper remains the default for all existing uses, unless the user explicitly opts in for the new models.

  4. Testing: Add tests to verify that when --transcriber parakeet or --transcriber canary is used, an audio file is transcribed correctly using the new model. Compare a short sample’s transcript between Whisper and the new models to ensure the pipeline works end-to-end. Also test that omission of the flag still uses Whisper as before.

Benefits

Enhanced Offline Transcription: Users get additional local transcription engines beyond Whisper. NVIDIA’s Canary and Parakeet models are cutting-edge – e.g. Canary-1B-v2 reportedly outperforms Whisper-large-v3 in English transcription speed and matches larger models in multilingual accuracy – so this could improve transcription quality and speed for Summarize users without requiring any cloud services.

Multilingual Support: Whisper already supports many languages; these NVIDIA models focus on ~25 languages (primarily European). Having them as an option may yield better accuracy in those languages or faster inference on CPU due to their optimized architecture (FastConformer). Users working with European language content might benefit from trying these models for potentially more accurate transcripts or faster turnaround.

CPU-Friendly Implementation: This feature aligns with Summarize’s philosophy of local-first, CPU-only AI integration. Both Parakeet and Canary are available in CPU-optimized ONNX form, including quantized models for efficiency, so users with just a standard CPU can run them. No dedicated GPU or special hardware is required, which lowers the barrier to using advanced ASR.

Optional and Backwards-Compatible: The addition is opt-in via a flag. Existing workflows (which default to Whisper or YouTube transcripts) remain unaffected unless the user chooses to try the new models. This ensures stability for current users, while offering power users a new capability.

Conclusion

Implementing support for NVIDIA’s Parakeet 0.6B-v3 and Canary 1B-v2 models would extend Summarize's functionality by providing high-performance, offline transcription alternatives to Whisper. The models are readily available under a permissive license (CC BY-4.0) and can be integrated using ONNX Runtime on CPU. By adding a --transcriber option and reusing the existing transcription pipeline, we can make this enhancement with minimal disruption. This feature will future-proof Summarize’s transcription module, enabling it to leverage the latest open ASR models and giving users choice over which engine best suits their needs.

References:

  1. steipete/Summarize documentation – Whisper local transcription preference.

  2. Hugging Face – ONNX model export of Parakeet-TDT 0.6B-v3 and vocab file.

  3. Hugging Face – ONNX model export of Canary 1B-v2 and vocab file.

  4. NVIDIA Research – Canary-1B-v2 & Parakeet-TDT-0.6B-v3 paper (ASR models outperforming Whisper).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions