feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5
Merged
wavekat-eason merged 43 commits intomainfrom Apr 7, 2026
Merged
feat: CUDA/TRT/CoreML execution provider + RTF benchmarks#5wavekat-eason merged 43 commits intomainfrom
wavekat-eason merged 43 commits intomainfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ORT's default kNextPowerOfTwo doubles the GPU memory arena on each extension, causing monotonic growth across synthesis calls. Switching to SameAsRequested limits allocation to actual peak usage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Minimum supported Linux is Ubuntu 24.04 (glibc 2.38+). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….04 setup docs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lder Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o Azure steps ORT bundles libonnxruntime_providers_cuda.so but not cuBLAS/cuDNN. Azure step 2 now installs cuda-libraries-12-6 and libcudnn9-cuda-12 from the NVIDIA CUDA apt repository. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SameAsRequested caused one cudaMalloc per unique KV-cache size. After a few synthesis iterations the growing KV cache produced 100+ different-sized CUDA allocations that fragmented the virtual address space, making later contiguous allocations (e.g. a 36 MB concat buffer) fail with OOM even though total free VRAM was sufficient. NextPowerOfTwo (ORT default) doubles the arena on extension, so the same decode loop needs only ~7 cudaMalloc calls total. All KV-cache allocations come from one contiguous block, eliminating fragmentation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each iteration now uses a different sentence from the pool (short: 6 variants, medium: 5, long: 5), cycling if iterations exceed pool size. CSV chars column reports actual per-iteration length; summary table shows the average. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--providerflag.ort/directorybench_rtfexample for real-time-factor benchmarking with CSV result logging and auto-updated README table06-cuda-provider.md) and benchmarking guide (07-benchmarking.md)Standard_NC4as_T4_v3Test plan
make checkpasses (clippy + fmt + tests, no features)make test-qwen3passes with CPU providersynthesizeexample runs with--provider cudaon a CUDA-enabled machinebench_rtfexample logs CSV and updates README table viamake update-readmedocs/06-cuda-provider.md,docs/07-benchmarking.md🤖 Generated with Claude Code