feat: CUDA/TRT/CoreML execution provider + RTF benchmarks by wavekat-eason · Pull Request #5 · wavekat/wavekat-tts

wavekat-eason · 2026-04-07T11:43:24Z

Summary

Add CUDA, TensorRT, and CoreML execution provider support for Qwen3-TTS via --provider flag
Fix ORT external-data validation by hard-linking HuggingFace Hub symlinks into a sibling .ort/ directory
Add bench_rtf example for real-time-factor benchmarking with CSV result logging and auto-updated README table
Add docs: Azure T4 CUDA setup guide (06-cuda-provider.md) and benchmarking guide (07-benchmarking.md)
Add benchmark results: CPU int4 (~2.0× RTF) and CUDA T4 int4 on Standard_NC4as_T4_v3

Test plan

make check passes (clippy + fmt + tests, no features)
make test-qwen3 passes with CPU provider
synthesize example runs with --provider cuda on a CUDA-enabled machine
bench_rtf example logs CSV and updates README table via make update-readme
Docs reviewed: docs/06-cuda-provider.md, docs/07-benchmarking.md

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ORT's default kNextPowerOfTwo doubles the GPU memory arena on each extension, causing monotonic growth across synthesis calls. Switching to SameAsRequested limits allocation to actual peak usage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Minimum supported Linux is Ubuntu 24.04 (glibc 2.38+). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

….04 setup docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lder Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…o Azure steps ORT bundles libonnxruntime_providers_cuda.so but not cuBLAS/cuDNN. Azure step 2 now installs cuda-libraries-12-6 and libcudnn9-cuda-12 from the NVIDIA CUDA apt repository. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SameAsRequested caused one cudaMalloc per unique KV-cache size. After a few synthesis iterations the growing KV cache produced 100+ different-sized CUDA allocations that fragmented the virtual address space, making later contiguous allocations (e.g. a 36 MB concat buffer) fail with OOM even though total free VRAM was sufficient. NextPowerOfTwo (ORT default) doubles the arena on extension, so the same decode loop needs only ~7 cudaMalloc calls total. All KV-cache allocations come from one contiguous block, eliminating fragmentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each iteration now uses a different sentence from the pool (short: 6 variants, medium: 5, long: 5), cycling if iterations exceed pool size. CSV chars column reports actual per-iteration length; summary table shows the average. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wavekat-eason and others added 30 commits April 7, 2026 16:16

feat: CUDA/TensorRT/CoreML execution provider support

f47213c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: rename 06-colab-cuda-gpu to 06-cuda-provider

0060e40

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add Google Colab setup section to cuda-provider doc

2806071

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add Colab badge and execution provider flags to README

c91aeb8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: remove WAV I/O note from README

81901da

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: fix Colab glibc incompatibility with ORT_STRATEGY=system

d8ddf7c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: fix ORT_LIB_LOCATION to point to capi/ subdirectory

0c42f4b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: create libonnxruntime.so symlink for ort-sys system strategy

d9f2912

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add shell variant for ORT setup in Colab terminal

a9e3035

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add ORT_PREFER_DYNAMIC_LINK=1 to fix static link fallback

e2c392b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add libonnxruntime.so.1 SONAME symlink for runtime linker

f7081d2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: fix ORT API version mismatch, remove version pin

016b455

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: use lexists to handle stale symlinks on ORT version upgrade

c7bc938

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: download into WAVEKAT_MODEL_DIR when files are missing

7a9d9b7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: fix ORT symlink escapes model dir via cp -rL from Drive

8857d45

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: convert model copy script to Python notebook cell

b623c60

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add --provider flag to synthesize example

1d57379

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: shim missing glibc 2.38 C23 strto* symbols for Colab

0ddf746

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: error on EP failure instead of silent CPU fallback

547d214

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add load-dynamic feature for glibc 2.35 compat (Colab)

92c8182

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

revert: remove glibc compat workarounds

f34e153

Minimum supported Linux is Ubuntu 24.04 (glibc 2.38+). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: hard-link HF Hub symlinks so ORT external data validation passes

8c0b394

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: hard-link symlink target, not symlink inode, for ORT compat

5a76311

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: add error_on_failure to CUDA/TRT/CoreML EPs; add Azure Ubuntu 24…

c29a6ca

….04 setup docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: call error_on_failure() on ExecutionProviderDispatch, not EP bui…

52f912c

…lder Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: replace empty-line quit with /quit command in interactive mode

756a6ad

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add bench_rtf example for qwen3-tts RTF benchmarking

ab56574

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: auto-update README benchmark table from CSV results

9fcfe67

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wavekat-eason and others added 13 commits April 7, 2026 22:53

bench: add CPU int4 baseline results (RTF ~2.0x)

b409f87

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add benchmarking guide (07-benchmarking.md)

04971ec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add TensorRT install step for Azure T4

9da95af

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: self-describing CSV with backend/precision/provider/hardware/date

2705051

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: tag cpu-int4 benchmark with Standard_NC4as_T4_v3 hardware

4842c17

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add CUDA int4 benchmark results for Standard_NC4as_T4_v3

d7bc1a6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add update-readme Make target

c4b2fbd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: add make ci target mirroring GitHub Actions

44052f5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style: apply cargo fmt

b4cc122

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: use is_multiple_of() in is_leap to satisfy clippy

c951985

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: mark CUDA provider doc as implemented

40d8c57

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wavekat-eason merged commit c3750c4 into main Apr 7, 2026

wavekat-eason deleted the feat/cuda-provider branch April 7, 2026 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5

feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5
wavekat-eason merged 43 commits intomainfrom
feat/cuda-provider

wavekat-eason commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wavekat-eason commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wavekat-eason commented Apr 7, 2026 •

edited

Loading