Add onnx gpu support and benchmarking suite for NeuTTS-Air. by NemesisGuy · Pull Request #60 · neuphonic/neutts

NemesisGuy · 2025-10-21T11:04:52Z

Pull Request : ONNX GPU Support

Summary

Extend NeuTTSAir to auto-select CUDA/MPS/CPU for the backbone and ONNX codec, configuring CUDA, DirectML, or ROCm providers when present and falling back to CPU with clear warnings when unavailable.
Document the new GPU workflow in README.md, examples/README.md, and the freshly added examples/onnx_example_gpu.py; ship requirements-gpu.txt for quick GPU setup.
Publish the new benchmarking suite (CLI plus artifacts) and introduce tests/test_device_selection.py to cover device routing and ONNX provider selection so regressions surface quickly.

Benchmarks

Windows 11 · NVIDIA GeForce GTX 1080 Ti

Backbone Repo	Backbone Device	Codec Repo	Codec Device	Providers	Runs	Load (s)	Infer (s)	Total (s)	RTF	RAM (MB)	VRAM (MB)
neuphonic/neutts-air	cpu	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	15.80 ± 22.34	44.23 ± 3.63	60.02 ± 23.28	12.31 ± 0.97	1,024 ± 1,437	0
neuphonic/neutts-air	cpu	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	19.57 ± 27.68	41.51 ± 2.25	61.08 ± 28.12	12.45 ± 0.04	317 ± 444	0
neuphonic/neutts-air	cuda	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	24.40 ± 34.51	7.89 ± 0.51	32.30 ± 34.00	2.87 ± 0.02	1.3 ± 0.6	2,890 ± 994
neuphonic/neutts-air	cuda	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	31.05 ± 43.91	7.92 ± 0.94	38.97 ± 43.50	2.48 ± 0.02	47 ± 64	2,891 ± 995
neuphonic/neutts-air-q4-gguf	cpu	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	1.61 ± 2.28	3.60 ± 0.20	5.20 ± 2.47	1.44 ± 0.07	394 ± 556	8.1
neuphonic/neutts-air-q4-gguf	cpu	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	1.28 ± 1.82	3.21 ± 0.02	4.50 ± 1.83	1.29 ± 0.01	151 ± 213	8.1
neuphonic/neutts-air-q4-gguf	cuda	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	1.26 ± 1.79	3.74 ± 0.59	5.01 ± 2.23	1.40 ± 0.03	393 ± 555	8.1
neuphonic/neutts-air-q4-gguf	cuda	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	1.27 ± 1.79	3.63 ± 0.50	4.90 ± 2.17	1.37 ± 0.06	152 ± 214	8.1
neuphonic/neutts-air-q8-gguf	cpu	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	1.41 ± 1.99	7.15 ± 1.66	8.55 ± 2.28	1.81 ± 0.08	273 ± 373	8.1
neuphonic/neutts-air-q8-gguf	cpu	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	1.37 ± 1.93	6.65 ± 1.40	8.02 ± 1.96	1.69 ± 0.03	30 ± 40	8.1
neuphonic/neutts-air-q8-gguf	cuda	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	1.33 ± 1.87	5.12 ± 0.74	6.44 ± 2.60	1.75 ± 0.02	268 ± 378	8.1
neuphonic/neutts-air-q8-gguf	cuda	neuphonic/neucodec-onnx-decoder	cuda	CUDAExecutionProvider, CPUExecutionProvider	3	1.29 ± 1.83	5.00 ± 0.90	6.30 ± 2.72	1.70 ± 0.04	40 ± 56	8.1

Testing

pytest tests/test_device_selection.py

- Added benchmarking utilities in `neuttsair/benchmark.py` and CLI for profiling ONNX providers. - Updated `README.md` and `examples/README.md` with new benchmarking instructions and device selection options. - Modified `basic_example.py` to support device arguments for backbone and codec. - Updated `CHANGELOG.md` to reflect new features and changes.

ThomasChan06 · 2025-10-22T05:59:21Z

will this work with mac? i have a 16 core gpu 10 core cpu m1 pro 16gb ram. i have it right now so it chunks and seperates and then puts them together all in terminal but it takes SOOO long to generate. if i need something thats like an hour long I do it over night.

NemesisGuy · 2025-10-22T07:05:40Z

Hey @ThomasChan06 👋
Thanks for checking it out! The auto device selection in this branch should detect MPS automatically on Apple Silicon (M1/M2/M3/M4). If torch.backends.mps.is_available() returns True, it’ll use MPS; otherwise it’ll fall back to CPU.

I don’t have a Mac to test directly, so if you’re able to try running it on your setup, that’d be super helpful 🙏.

Once it’s running, you can also try the benchmark script — it’ll help you see whether MPS or CPU performs better for your specific hardware. M1/M2 chips sometimes behave differently depending on tensor precision (fp16 vs fp32) and task type, so this can help find the sweet spot for long jobs.

Would be awesome if you could share your results after running a short test!

mgc8 · 2025-11-02T22:49:19Z

@NemesisGuy I tried testing this on an M3 Mac, but the benchmark files (both under neuttsair/ as well as examples/ are missing:

$ python -m examples.provider_benchmark \
       --input_text "Benchmarking NeuTTS Air" \
       --ref_codes samples/dave.pt \
       --ref_text samples/dave.txt \
       --runs 3 \
       --output benchmark_results.json
./.venv/bin/python: No module named examples.provider_benchmark

The are also missing from here as well as your fork repo: https://github.com/neuphonic/neutts-air/pull/60/files

Apart from that, I can validate that the code correctly chooses the mps backend for the backbone, and it's significantly faster (e.g. the streaming example runs almost realtime now). There is no onnxruntime-gpu for Apple Silicon (or at least not with that name; I found https://github.com/cansik/onnxruntime-silicon but not on PyPI), though the code falls back on CPU correctly.

…n changelog

…ling

mgc8 · 2025-11-03T16:48:24Z

Thanks for adding the missing files! Here is the benchmark on an Apple Silicon M3 Max (16" MBP) -- running the standard benchmark script as listed above:

System information:

OS: macOS-15.5-arm64-arm-64bit-Mach-O
CPU: arm (16 physical / 16 logical)
RAM: 128.0 GB
GPUs: none detected

Benchmark summary (mean ± standard deviation):

Backbone Repo	Backbone Device	Codec Repo	Codec Device	Providers	Runs	Load (s)	Infer (s)	Total (s)	RTF	RAM (MB)	VRAM (MB)
neuphonic/neutts-air	auto	neuphonic/neucodec-onnx-decoder	auto	CoreMLExecutionProvider, CPUExecutionProvider	3	7.891 ± 11.160	3.648 ± 0.877	11.539 ± 11.057	1.177 ± 0.030	801.958 ± 876.850	n/a
neuphonic/neutts-air	auto	neuphonic/neucodec-onnx-decoder	coreml	CoreMLExecutionProvider, CPUExecutionProvider	3	7.761 ± 10.976	3.304 ± 0.120	11.065 ± 10.973	1.145 ± 0.006	159.198 ± 173.210	n/a
neuphonic/neutts-air	auto	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	7.024 ± 9.933	3.355 ± 0.627	10.379 ± 9.720	1.039 ± 0.002	22.938 ± 20.870	n/a

Here is the JSON file as well:
benchmark_results.json

These are the results with --backbone "neuphonic/neutts-air-q[4|8]-gguf -- much faster:

Backbone Repo	Backbone Device	Codec Repo	Codec Device	Providers	Runs	Load (s)	Infer (s)	Total (s)	RTF	RAM (MB)	VRAM (MB)
neuphonic/neutts-air-q4-gguf	auto	neuphonic/neucodec-onnx-decoder	auto	CoreMLExecutionProvider, CPUExecutionProvider	3	1.888 ± 2.671	4.264 ± 0.015	6.152 ± 2.672	1.583 ± 0.002	976.849 ± 1143.486	n/a
neuphonic/neutts-air-q4-gguf	auto	neuphonic/neucodec-onnx-decoder	coreml	CoreMLExecutionProvider, CPUExecutionProvider	3	1.410 ± 1.994	4.288 ± 0.018	5.698 ± 2.012	1.592 ± 0.007	441.885 ± 562.543	n/a
neuphonic/neutts-air-q4-gguf	auto	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	0.713 ± 1.008	4.027 ± 0.029	4.740 ± 0.980	1.495 ± 0.013	380.547 ± 538.141	n/a
neuphonic/neutts-air-q8-gguf	auto	neuphonic/neucodec-onnx-decoder	auto	CoreMLExecutionProvider, CPUExecutionProvider	3	1.632 ± 2.309	5.568 ± 0.436	7.201 ± 2.183	1.638 ± 0.003	1078.354 ± 1222.825	n/a
neuphonic/neutts-air-q8-gguf	auto	neuphonic/neucodec-onnx-decoder	coreml	CoreMLExecutionProvider, CPUExecutionProvider	3	1.442 ± 2.039	5.600 ± 0.431	7.041 ± 1.910	1.647 ± 0.007	454.708 ± 623.731	n/a
neuphonic/neutts-air-q8-gguf	auto	neuphonic/neucodec-onnx-decoder	cpu	CPUExecutionProvider	3	0.778 ± 1.100	5.343 ± 0.416	6.121 ± 0.988	1.572 ± 0.012	387.135 ± 547.349	n/a

It seems that with this version, part of the auto-detection fails and the MPS backend is not used at all for the "backbone", it defaults to CPU due to "CUDA unavailable", despite MPS being perfectly fine. If I can run any other tests, please let me know.

…ackbones; docs and benchmark updates

…r GGUF backbones; docs and benchmark updates" This reverts commit 88147e5.

…when precomputed file missing

NemesisGuy added 7 commits October 13, 2025 14:02

Add ONNX decoder GPU support with CPU fallback

98ebc11

Add GPU-specific requirements file

588f6ef

Update .gitignore to exclude integration_sdk and samples

6f60d15

WIP: stabilize ONNX GPU feature before sync

edcaf30

Merge remote-tracking branch 'upstream/main' into add-onnx-gpu-support

39b78fa

chore: ignore benchmark exports

ab016b1

NemesisGuy added 5 commits November 3, 2025 11:27

Add provider_benchmark.py (force-add since .gitignore excluded it)

f86c3e4

chore: add benchmark module (neuttsair.benchmark)

c181495

docs: add macOS ONNX runtime guidance; note Apple hardware untested i…

cfdb530

…n changelog

feat: update benchmark and neutts modules for ONNX GPU/CPU/macOS hand…

26c748f

…ling

docs: move PR response note into PR_DRAFT.md (remove from README)

de88fa3

NemesisGuy added 4 commits November 4, 2025 09:50

chore: add Apple Silicon ONNX provider support; prefer MPS for GGUF b…

88147e5

…ackbones; docs and benchmark updates

Revert "chore: add Apple Silicon ONNX provider support; prefer MPS fo…

f05522e

…r GGUF backbones; docs and benchmark updates" This reverts commit 88147e5.

fix: reapply MPS fallback and provider note after mistaken revert

e545eef

fix: fallback-load/encode reference codes in basic_streaming_example …

9d9de0e

…when precomputed file missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add onnx gpu support and benchmarking suite for NeuTTS-Air.#60

Add onnx gpu support and benchmarking suite for NeuTTS-Air.#60
NemesisGuy wants to merge 16 commits intoneuphonic:mainfrom
NemesisGuy:add-onnx-gpu-support

NemesisGuy commented Oct 21, 2025 •

edited

Loading

Uh oh!

ThomasChan06 commented Oct 22, 2025

Uh oh!

NemesisGuy commented Oct 22, 2025

Uh oh!

mgc8 commented Nov 2, 2025 •

edited

Loading

Uh oh!

mgc8 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NemesisGuy commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request : ONNX GPU Support

Summary

Benchmarks

Testing

Uh oh!

ThomasChan06 commented Oct 22, 2025

Uh oh!

NemesisGuy commented Oct 22, 2025

Uh oh!

mgc8 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgc8 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NemesisGuy commented Oct 21, 2025 •

edited

Loading

mgc8 commented Nov 2, 2025 •

edited

Loading