Skip to content

feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5

Merged
wavekat-eason merged 43 commits intomainfrom
feat/cuda-provider
Apr 7, 2026
Merged

feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5
wavekat-eason merged 43 commits intomainfrom
feat/cuda-provider

Conversation

@wavekat-eason
Copy link
Copy Markdown
Contributor

@wavekat-eason wavekat-eason commented Apr 7, 2026

Summary

  • Add CUDA, TensorRT, and CoreML execution provider support for Qwen3-TTS via --provider flag
  • Fix ORT external-data validation by hard-linking HuggingFace Hub symlinks into a sibling .ort/ directory
  • Add bench_rtf example for real-time-factor benchmarking with CSV result logging and auto-updated README table
  • Add docs: Azure T4 CUDA setup guide (06-cuda-provider.md) and benchmarking guide (07-benchmarking.md)
  • Add benchmark results: CPU int4 (~2.0× RTF) and CUDA T4 int4 on Standard_NC4as_T4_v3

Test plan

  • make check passes (clippy + fmt + tests, no features)
  • make test-qwen3 passes with CPU provider
  • synthesize example runs with --provider cuda on a CUDA-enabled machine
  • bench_rtf example logs CSV and updates README table via make update-readme
  • Docs reviewed: docs/06-cuda-provider.md, docs/07-benchmarking.md

🤖 Generated with Claude Code

wavekat-eason and others added 30 commits April 7, 2026 16:16
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ORT's default kNextPowerOfTwo doubles the GPU memory arena on each
extension, causing monotonic growth across synthesis calls. Switching
to SameAsRequested limits allocation to actual peak usage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimum supported Linux is Ubuntu 24.04 (glibc 2.38+).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….04 setup docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lder

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o Azure steps

ORT bundles libonnxruntime_providers_cuda.so but not cuBLAS/cuDNN.
Azure step 2 now installs cuda-libraries-12-6 and libcudnn9-cuda-12
from the NVIDIA CUDA apt repository.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wavekat-eason and others added 13 commits April 7, 2026 22:53
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SameAsRequested caused one cudaMalloc per unique KV-cache size. After a
few synthesis iterations the growing KV cache produced 100+ different-sized
CUDA allocations that fragmented the virtual address space, making later
contiguous allocations (e.g. a 36 MB concat buffer) fail with OOM even
though total free VRAM was sufficient.

NextPowerOfTwo (ORT default) doubles the arena on extension, so the same
decode loop needs only ~7 cudaMalloc calls total. All KV-cache allocations
come from one contiguous block, eliminating fragmentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each iteration now uses a different sentence from the pool
(short: 6 variants, medium: 5, long: 5), cycling if iterations
exceed pool size. CSV chars column reports actual per-iteration
length; summary table shows the average.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wavekat-eason wavekat-eason merged commit c3750c4 into main Apr 7, 2026
@wavekat-eason wavekat-eason deleted the feat/cuda-provider branch April 7, 2026 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant