feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553
feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553tarekgh wants to merge 3 commits intogoogleapis:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Vertex AI live models by implementing the IRealtimeClient and IRealtimeClientSession interfaces. Key additions include a new WebSocket-based transport layer, internal types for the Gemini Live API, and source-generated JSON context to ensure Native AOT compatibility. Furthermore, the PredictionServiceChatClient has been optimized to handle tool arguments and results through direct conversion between objects and Protobuf Struct/Value types, avoiding inefficient JSON string round-tripping and adding a nesting depth limit. Feedback focuses on further optimizing the transport layer by using pooled buffers and direct stream deserialization, as well as improving the normalization of tool payloads for unknown types.
|
I'll get to this sometime Tuesday/Wednesday. Thanks! |
|
Thanks so much for your help on this! As noted in the description, there are a few additional changes in the SDK itself to support AOT compatibility. I’d also really appreciate any help in getting visibility or support for this PR as well: googleapis/dotnet-genai#256. |
amanda-tarafa
left a comment
There was a problem hiding this comment.
I've done a quick review and left a few comments.
For now, we won't be supporting a handwritten client, even if that's required for AoT. We want to keep this library lightweight and as a simple wrapper of the generated Google.Cloud.AiPlatform.V1.
We have on this year's roadmap looking into making the generated code AoT compatible and at that point, Google.Cloud.VertexAI.Extensions will benefit from it.
|
|
||
| namespace Google.GenAI; | ||
|
|
||
| internal sealed class Client : IAsyncDisposable |
There was a problem hiding this comment.
Can't we use the generated client here? Se the other extensions.
|
Thanks for the review, @amanda-tarafa! I'll address the inline comments (history.md, version bump, namespace); those are straightforward. Regarding the architectural concern about the handwritten client, I want to explain why the Realtime provider can't follow the same pattern as the other extensions. Please correct me if I got anything wrong here. I may be missing context about the generated SDK's roadmap or capabilities. How the other extensions work: Why the Realtime provider can't do this: The Vertex AI Live API uses a completely different transport and service. It's a WebSocket-based bidirectional streaming endpoint on a service called LlmBidiService: This service has no generated client; there is no Could the generated bidi-streaming gRPC methods work instead? The generated This is the same approach used in the dotnet-genai sister PR, which also implements a WebSocket-based transport for the same endpoint. In summary: the handwritten WebSocket transport isn't a choice to bypass the generated client; it's a necessity because no generated client exists for this API surface, and the server only accepts WebSocket connections for the Live API. Happy to discuss the best path forward. |
|
@tarekgh I ran the checks and they are failing because of some dependencies we had to update. You'll need to rebase on main. |
…ealtimeClientSession Add a Vertex AI Live API provider implementing the Microsoft.Extensions.AI Realtime abstractions (IRealtimeClient / IRealtimeClientSession), enabling real-time audio, text, image, and function-calling conversations with Vertex AI models through the standardized MEAI interface. New files: - PredictionServiceRealtimeClient.cs - PredictionServiceRealtimeSession.cs - InternalLiveTransport.cs / InternalLiveTypes.cs / InternalLiveJsonContext.cs - BuildIRealtimeClientTest.cs / AotJsonContextTest.cs Also includes AOT improvements to PredictionServiceChatClient.cs, avoiding JSON round-tripping for tool arguments/results.
…ream deserialization - NormalizeToolPayload: Use JsonSerializer.SerializeToElement with AIJsonUtilities.DefaultOptions for unknown POCO types instead of ToString(), consistent with PredictionServiceChatClient. - ReceiveAsync: Use cached _receiveBuffer field instead of allocating 4KB per call, reducing GC pressure in high-frequency audio streaming. - ReceiveAsync: Deserialize directly from MemoryStream instead of ToArray() + UTF8.GetString(), avoiding intermediate copies.
- Revert version from beta08 to beta07 (release pipeline handles versioning) - Remove manual beta08 release notes from history.md - Move internal Live API types from Google.GenAI/Google.GenAI.Types namespaces to Google.Cloud.VertexAI.Extensions.Live to follow project conventions - Move LiveJsonContext to Google.Cloud.VertexAI.Extensions namespace - Rename internal Type class to SchemaType to avoid System.Type conflict - Update tests to reflect new namespace paths
1e5d37f to
8fd5545
Compare
|
@amanda-tarafa I have rebased and should be ready now. Thanks! |
Summary
Adds a Vertex AI Live API provider implementing the
Microsoft.Extensions.AIRealtime abstractions (IRealtimeClient/IRealtimeClientSession), enabling real-time audio, text, image, and function-calling conversations with Vertex AI models through the standardized MEAI interface.This follows the same pattern as the Gemini Realtime provider PR in the
dotnet-genairepository, and is consistent with the existingIChatClientimplementation (PredictionServiceChatClient) in this package.AOT Compatibility
This PR also includes cross-cutting AOT (Ahead-of-Time) compilation improvements that span the entire SDK, not just the realtime provider:
Realtime provider AOT support:
InternalLiveJsonContext— a source-generatedJsonSerializerContextwith[JsonSerializable]entries for all Live API types (includingDictionary<string, object?>for function call arguments)LiveJsonContext.Default.LiveClientMessage/LiveJsonContext.Default.LiveServerMessage[JsonPropertyName]attributes — no reflection-based namingAotJsonContextTest.csverifying source-gen coverage, nested type auto-discovery, round-trip correctness, andDefaultIgnoreCondition.WhenWritingNullbehaviorChat client AOT improvement (
PredictionServiceChatClient.cs):Structvalues were converted viaStruct.Parser.ParseJson(JsonSerializer.Serialize(value))— a serialize-then-parse pattern that relies on reflection-basedJsonSerializer.Serialize. Now uses directStruct↔ dictionary conversion, avoidingSystem.Text.Jsonentirely on the tool-call hot path. This makes the existingIChatClientmore AOT-friendly as a side effect.Usage Example
What's Included
New Files
PredictionServiceRealtimeClient.cs—IRealtimeClientimplementation that wraps aPredictionServiceClientBuilder, resolves credentials (ADC, service account JSON, scoped OAuth), builds the WebSocket connection, and creates realtime sessions via the Vertex AI Live API.PredictionServiceRealtimeSession.cs—IRealtimeClientSessionimplementation that manages the WebSocket connection, audio buffering with ActivityStart/ActivityEnd framing, message mapping, image sending viaclientContent, and function call orchestration.InternalLiveTransport.cs— Internal WebSocket transport (Client,Live,AsyncSession) handling connection lifecycle, credential headers, binary frame send/receive, and graceful disposal with close timeouts.InternalLiveTypes.cs— Internal JSON-serializable types for the Vertex AI Live API protocol (client messages, server messages, blobs, function calls, schemas, etc.).InternalLiveJsonContext.cs— Source-generatedJsonSerializerContextfor AOT-safe serialization of all Live API types.BuildIRealtimeClientTest.cs— 42 unit tests covering client construction, session config mapping, audio commit flow, message mapping, function call handling, disposal, and edge cases.AotJsonContextTest.cs— 4 unit tests verifying AOT source-gen coverage, nested type discovery, and serialization round-trips.Modified Files
VertexAIExtensions.cs— AddedBuildIRealtimeClient()/BuildIRealtimeClientAsync()extension methods onPredictionServiceClientBuilder.PredictionServiceChatClient.cs— Eliminated JSON round-tripping for tool arguments/results to improve AOT compatibility (directStruct↔ dictionary conversion).BuildIChatClientTest.cs— Updated tests reflecting the chat client tool-handling changes.Google.Cloud.VertexAI.Extensions.csproj— Version bumped to1.0.0-beta08; addedMicrosoft.Bcl.AsyncInterfacesdependency fornetstandard2.0/net462.docs/history.md— Release notes for1.0.0-beta08.Features
audioStreamEnd; automatic activity detection is always disabled in favor of explicit ActivityStart/ActivityEnd framing that reliably triggers model responsesclientContentwithinlineDatafor proper multimodal conversation contextFunctionInvokingRealtimeSessionmiddleware; tool responses batched into a singleSendToolResponseAsynccallSemaphoreSlimserializes all WebSocket sends, safe for concurrent middleware + caller usageCreateSessionAsyncwaits for the server'sSetupCompletebefore returningJsonCredentials), and explicitGoogleCredential— all automatically scoped withcloud-platformOAuth scopeConvertJsonSchemaToGoogleSchemaenforces MaxDepth=32 to prevent stack overflow from deeply nested schemasResponseDonemessages, correctly handled even whenTurnCompleteandUsageMetadataarrive in the same server messageKey Design Decisions
Always manual activity detection — Vertex AI does not support
audioStreamEnd(confirmed bydotnet-genai'sLiveConverters.cs). When automatic activity detection is enabled server-side, the server ignores manualActivityStart/ActivityEndsignals, leaving no way to trigger a response. The provider always forcesAutomaticActivityDetection.Disabled = trueand uses explicitActivityStart→ audio →ActivityEndframing regardless of the user'sVoiceActivityDetection.Enabledsetting. TheAllowInterruptionoption is still respected viaactivityHandling.Images via
clientContent— Static images are sent viaclientContentwithinlineDataparts (proper conversation content) rather thanrealtimeInput.video(designed for streaming video frames). This ensures the model properly processes images as part of the conversation context.Tool response batching — The MEAI
FunctionInvokingRealtimeSessionmiddleware sends separateCreateConversationItemper function result. Gemini expects all results in oneSendToolResponseAsynccall. The provider buffers results and flushes them as a single batch whenCreateResponsearrives.TurnComplete suppression after tool responses — After
SendToolResponseAsync, the Gemini model automatically continues generating. The provider tracks this via_lastSendWasToolResponseand skips redundant triggers.SetupComplete handshake — The
CreateSessionAsyncmethod drains the server'sSetupCompleteacknowledgment before returning, ensuring the session is fully ready before the caller sends audio or text.Audio buffer cap — Audio appends are capped at 10 MB to prevent unbounded memory growth. Frames exceeding 32 KB are automatically split.
Consistent with IChatClient — The realtime provider follows the same patterns as
PredictionServiceChatClient: same namespace, same builder extension methods (BuildIRealtimeClient/BuildIRealtimeClientAsync), same credential resolution, sameGetServicepattern exposing the underlyingClientviaIServiceProvider.Test Coverage
151 unit tests (42 new + 109 existing), covering:
BuildLiveConnectConfigoption combinations (modalities, voice, tools, transcription, VAD, max tokens)