[feat] create client for TTS and evaluation for audio requests#175
[feat] create client for TTS and evaluation for audio requests#175
Conversation
ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Free Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughIntroduces comprehensive Text-to-Speech (TTS) client support to the benchmarking framework, including a new streaming TTS client supporting multiple providers, configuration management with provider-specific validation, performance evaluation with audio metrics tracking, and artifact persistence. Changes
Sequence DiagramsequenceDiagram
participant User as Benchmark Harness
participant TTS as TTSClient
participant Provider as TTS Provider<br/>(API)
participant Evaluator as AudioPerformanceEvaluator
participant Storage as File System
User->>TTS: send_request(text_request)
activate TTS
TTS->>TTS: _build_payload(text)
TTS->>Provider: POST /synthesize (streaming)
activate Provider
Provider-->>TTS: audio chunk 1
TTS->>TTS: collect chunk, measure ttfa
Provider-->>TTS: audio chunk 2
TTS->>TTS: collect chunk
Provider-->>TTS: audio chunk N
deactivate Provider
TTS->>TTS: aggregate bytes, calculate metrics<br/>(duration, rtf, tokens)
TTS-->>User: RequestResult(audio_channel, metrics)
deactivate TTS
User->>Evaluator: record_request_completed(response)
activate Evaluator
Evaluator->>Evaluator: extract AUDIO metrics<br/>from response
Evaluator->>Evaluator: update CDF sketches<br/>(ttfa, latency, duration, rtf)
Evaluator->>Storage: cache audio buffer<br/>(if save_audio_files)
deactivate Evaluator
User->>Evaluator: record_session_completed()
activate Evaluator
Evaluator->>Evaluator: update session metrics
deactivate Evaluator
User->>Evaluator: finalize()
activate Evaluator
Evaluator->>Storage: save JSONL metrics
Evaluator->>Storage: save CSV per metric
Evaluator->>Storage: generate CDF plots
Evaluator->>Storage: save WAV files<br/>(if enabled)
Evaluator-->>User: EvaluationResult
deactivate Evaluator
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
| self.additional_sampling_params | ||
| ) | ||
|
|
||
| def build_tokenizer_provider(self): |
There was a problem hiding this comment.
We are trying to stay away from including logic in configuration classes. Can you explain why this PR creates these methods inside the various config classes?
There was a problem hiding this comment.
Sure will move it out of the post init
|
|
||
|
|
||
| @frozen_dataclass | ||
| class TTSClientConfig(BaseClientConfig): |
There was a problem hiding this comment.
Would it be possible to not include modality-specific clients? Is there any fundamental reason for why we can't have a multimodal client that handles audio + other modalities, just like the skeleton for the OpenAI chat suggests?
An important design decision was that, within the limits of what's possible, requests can contain and request arbitrary payload channels, which breaks if we only support modality-specific clients. That would make it a lot harder to evaluate true multimodal systems.
There was a problem hiding this comment.
I think we can have a single MultimodalClient that can support Text/Audio input channels and Text/Audio output channels.
The only issue I feel with the whole thing is that there is no strong OpenAI type standard for the TextToSpeech and it's between ElevenLabs and Deepgram both of whom have their own ways of doing things, so there might be a lot of conditionals in the MultimodalClient
There was a problem hiding this comment.
I see. How about following the steps of the OpenAI Router client for that? It's per-request, and yes there will be conditionals but at least we don't break veeksha ux.
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to Veeksha! Before submitting the pull request, please ensure the PR meets the following criteria. This helps Veeksha maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[Feat]for new features.[Core]for changes in the core benchmarking logic[CI/Build]for build or continuous integration improvements.[Docs]for documentation fixes and improvements.[Tests]for changes in the test suite.[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
make formatto format your code.docs/source/if the PR modifies the user-facing behaviors of Veeksha. It helps user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-requiredand might not go through the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to Veeksha. Your contributions make Veeksha a great tool for everyone!
Summary by CodeRabbit
Release Notes
New Features
Chores