Skip to content

Add onnx gpu support and benchmarking suite for NeuTTS-Air.#60

Open
NemesisGuy wants to merge 16 commits intoneuphonic:mainfrom
NemesisGuy:add-onnx-gpu-support
Open

Add onnx gpu support and benchmarking suite for NeuTTS-Air.#60
NemesisGuy wants to merge 16 commits intoneuphonic:mainfrom
NemesisGuy:add-onnx-gpu-support

Conversation

@NemesisGuy
Copy link
Copy Markdown

@NemesisGuy NemesisGuy commented Oct 21, 2025

Pull Request : ONNX GPU Support

Summary

  • Extend NeuTTSAir to auto-select CUDA/MPS/CPU for the backbone and ONNX codec, configuring CUDA, DirectML, or ROCm providers when present and falling back to CPU with clear warnings when unavailable.
  • Document the new GPU workflow in README.md, examples/README.md, and the freshly added examples/onnx_example_gpu.py; ship requirements-gpu.txt for quick GPU setup.
  • Publish the new benchmarking suite (CLI plus artifacts) and introduce tests/test_device_selection.py to cover device routing and ONNX provider selection so regressions surface quickly.

Benchmarks

Windows 11 · NVIDIA GeForce GTX 1080 Ti

Backbone Repo Backbone Device Codec Repo Codec Device Providers Runs Load (s) Infer (s) Total (s) RTF RAM (MB) VRAM (MB)
neuphonic/neutts-air cpu neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 15.80 ± 22.34 44.23 ± 3.63 60.02 ± 23.28 12.31 ± 0.97 1,024 ± 1,437 0
neuphonic/neutts-air cpu neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 19.57 ± 27.68 41.51 ± 2.25 61.08 ± 28.12 12.45 ± 0.04 317 ± 444 0
neuphonic/neutts-air cuda neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 24.40 ± 34.51 7.89 ± 0.51 32.30 ± 34.00 2.87 ± 0.02 1.3 ± 0.6 2,890 ± 994
neuphonic/neutts-air cuda neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 31.05 ± 43.91 7.92 ± 0.94 38.97 ± 43.50 2.48 ± 0.02 47 ± 64 2,891 ± 995
neuphonic/neutts-air-q4-gguf cpu neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 1.61 ± 2.28 3.60 ± 0.20 5.20 ± 2.47 1.44 ± 0.07 394 ± 556 8.1
neuphonic/neutts-air-q4-gguf cpu neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 1.28 ± 1.82 3.21 ± 0.02 4.50 ± 1.83 1.29 ± 0.01 151 ± 213 8.1
neuphonic/neutts-air-q4-gguf cuda neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 1.26 ± 1.79 3.74 ± 0.59 5.01 ± 2.23 1.40 ± 0.03 393 ± 555 8.1
neuphonic/neutts-air-q4-gguf cuda neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 1.27 ± 1.79 3.63 ± 0.50 4.90 ± 2.17 1.37 ± 0.06 152 ± 214 8.1
neuphonic/neutts-air-q8-gguf cpu neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 1.41 ± 1.99 7.15 ± 1.66 8.55 ± 2.28 1.81 ± 0.08 273 ± 373 8.1
neuphonic/neutts-air-q8-gguf cpu neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 1.37 ± 1.93 6.65 ± 1.40 8.02 ± 1.96 1.69 ± 0.03 30 ± 40 8.1
neuphonic/neutts-air-q8-gguf cuda neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 1.33 ± 1.87 5.12 ± 0.74 6.44 ± 2.60 1.75 ± 0.02 268 ± 378 8.1
neuphonic/neutts-air-q8-gguf cuda neuphonic/neucodec-onnx-decoder cuda CUDAExecutionProvider, CPUExecutionProvider 3 1.29 ± 1.83 5.00 ± 0.90 6.30 ± 2.72 1.70 ± 0.04 40 ± 56 8.1

Testing

  • pytest tests/test_device_selection.py

- Added benchmarking utilities in `neuttsair/benchmark.py` and CLI for profiling ONNX providers.
- Updated `README.md` and `examples/README.md` with new benchmarking instructions and device selection options.
- Modified `basic_example.py` to support device arguments for backbone and codec.
- Updated `CHANGELOG.md` to reflect new features and changes.
@ThomasChan06
Copy link
Copy Markdown

will this work with mac? i have a 16 core gpu 10 core cpu m1 pro 16gb ram. i have it right now so it chunks and seperates and then puts them together all in terminal but it takes SOOO long to generate. if i need something thats like an hour long I do it over night.

@NemesisGuy
Copy link
Copy Markdown
Author

Hey @ThomasChan06 👋
Thanks for checking it out! The auto device selection in this branch should detect MPS automatically on Apple Silicon (M1/M2/M3/M4). If torch.backends.mps.is_available() returns True, it’ll use MPS; otherwise it’ll fall back to CPU.

I don’t have a Mac to test directly, so if you’re able to try running it on your setup, that’d be super helpful 🙏.

Once it’s running, you can also try the benchmark script — it’ll help you see whether MPS or CPU performs better for your specific hardware. M1/M2 chips sometimes behave differently depending on tensor precision (fp16 vs fp32) and task type, so this can help find the sweet spot for long jobs.

Would be awesome if you could share your results after running a short test!

@mgc8
Copy link
Copy Markdown

mgc8 commented Nov 2, 2025

@NemesisGuy I tried testing this on an M3 Mac, but the benchmark files (both under neuttsair/ as well as examples/ are missing:

$ python -m examples.provider_benchmark \
       --input_text "Benchmarking NeuTTS Air" \
       --ref_codes samples/dave.pt \
       --ref_text samples/dave.txt \
       --runs 3 \
       --output benchmark_results.json
./.venv/bin/python: No module named examples.provider_benchmark

The are also missing from here as well as your fork repo: https://github.com/neuphonic/neutts-air/pull/60/files

Apart from that, I can validate that the code correctly chooses the mps backend for the backbone, and it's significantly faster (e.g. the streaming example runs almost realtime now). There is no onnxruntime-gpu for Apple Silicon (or at least not with that name; I found https://github.com/cansik/onnxruntime-silicon but not on PyPI), though the code falls back on CPU correctly.

@mgc8
Copy link
Copy Markdown

mgc8 commented Nov 3, 2025

Thanks for adding the missing files! Here is the benchmark on an Apple Silicon M3 Max (16" MBP) -- running the standard benchmark script as listed above:

System information:

OS: macOS-15.5-arm64-arm-64bit-Mach-O
CPU: arm (16 physical / 16 logical)
RAM: 128.0 GB
GPUs: none detected

Benchmark summary (mean ± standard deviation):

Backbone Repo Backbone Device Codec Repo Codec Device Providers Runs Load (s) Infer (s) Total (s) RTF RAM (MB) VRAM (MB)
neuphonic/neutts-air auto neuphonic/neucodec-onnx-decoder auto CoreMLExecutionProvider, CPUExecutionProvider 3 7.891 ± 11.160 3.648 ± 0.877 11.539 ± 11.057 1.177 ± 0.030 801.958 ± 876.850 n/a
neuphonic/neutts-air auto neuphonic/neucodec-onnx-decoder coreml CoreMLExecutionProvider, CPUExecutionProvider 3 7.761 ± 10.976 3.304 ± 0.120 11.065 ± 10.973 1.145 ± 0.006 159.198 ± 173.210 n/a
neuphonic/neutts-air auto neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 7.024 ± 9.933 3.355 ± 0.627 10.379 ± 9.720 1.039 ± 0.002 22.938 ± 20.870 n/a

Here is the JSON file as well:
benchmark_results.json

These are the results with --backbone "neuphonic/neutts-air-q[4|8]-gguf -- much faster:

Backbone Repo Backbone Device Codec Repo Codec Device Providers Runs Load (s) Infer (s) Total (s) RTF RAM (MB) VRAM (MB)
neuphonic/neutts-air-q4-gguf auto neuphonic/neucodec-onnx-decoder auto CoreMLExecutionProvider, CPUExecutionProvider 3 1.888 ± 2.671 4.264 ± 0.015 6.152 ± 2.672 1.583 ± 0.002 976.849 ± 1143.486 n/a
neuphonic/neutts-air-q4-gguf auto neuphonic/neucodec-onnx-decoder coreml CoreMLExecutionProvider, CPUExecutionProvider 3 1.410 ± 1.994 4.288 ± 0.018 5.698 ± 2.012 1.592 ± 0.007 441.885 ± 562.543 n/a
neuphonic/neutts-air-q4-gguf auto neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 0.713 ± 1.008 4.027 ± 0.029 4.740 ± 0.980 1.495 ± 0.013 380.547 ± 538.141 n/a
neuphonic/neutts-air-q8-gguf auto neuphonic/neucodec-onnx-decoder auto CoreMLExecutionProvider, CPUExecutionProvider 3 1.632 ± 2.309 5.568 ± 0.436 7.201 ± 2.183 1.638 ± 0.003 1078.354 ± 1222.825 n/a
neuphonic/neutts-air-q8-gguf auto neuphonic/neucodec-onnx-decoder coreml CoreMLExecutionProvider, CPUExecutionProvider 3 1.442 ± 2.039 5.600 ± 0.431 7.041 ± 1.910 1.647 ± 0.007 454.708 ± 623.731 n/a
neuphonic/neutts-air-q8-gguf auto neuphonic/neucodec-onnx-decoder cpu CPUExecutionProvider 3 0.778 ± 1.100 5.343 ± 0.416 6.121 ± 0.988 1.572 ± 0.012 387.135 ± 547.349 n/a

It seems that with this version, part of the auto-detection fails and the MPS backend is not used at all for the "backbone", it defaults to CPU due to "CUDA unavailable", despite MPS being perfectly fine. If I can run any other tests, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants