-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Description
Hello, development team!
At the moment, I’m experimenting with giga-rnnt-v2, focusing on parallel inference of the model.
What has been done so far:
0. The model sherpa-onnx-nemo-transducer-giga-am-v2-russian-2025-04-19.tar.bz2 was downloaded from here: https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models
- ONNX inference was launched in Python using onnx-sherpa on both CPU and GPU. A one-minute-long audio file was transcribed. The test ran with 1 pool and 8 threads.
- ONNX inference was also launched in Go using onnx-sherpa-go on CPU. It was tested with 1 to 20 threads using Go coroutines to process 1 to 100 audio samples (each ~12 seconds long) in parallel.
Here are some questions that came up:
- In Go, changing num_threads in the ONNX config doesn't affect CPU utilization — it remains at 100%, whether 1 or 20 threads are used. What could be the reason?
- In Python, inference of the one-minute recording takes 7 seconds on GPU and 10 seconds on CPU, with num_threads=8 in a single pool. It seems GPU inference should be significantly faster — but if I’m wrong, please clarify.
- What are some standard ways to increase the model’s throughput at the expense of latency?
Metadata
Metadata
Assignees
Labels
No labels