AutomaticActivityDetection in Gemini Live API does not detect end-of-speech when audio stream has gaps (silent frames not sent)

When using the Gemini Live API (google-genai .NET SDK v1.5.0) with AutomaticActivityDetection for server-side VAD, the model fails to detect end-of-speech and takes 10-40 seconds to respond. The root cause could be audio source filters out silent audio frames before forwarding to Gemini, so the server-side VAD never receives the silence needed to trigger SilenceDurationMs.

We attempted multiple workarounds — sending zero-filled PCM silence buffers, switching to explicit VAD with ActivityStart/ActivityEnd signals and Disabled = true — but the server immediately closes the WebSocket connection when explicit VAD mode is used.

Environment:
•	SDK: Google.GenAI NuGet package v1.5.0
•	Runtime: .NET 8.0
•	Model: gemini-2.5-flash-native-audio-latest or gemini-2.5-flash-native-audio-preview-12-2025
•	Audio format: PCM 24kHz 16-bit mono (audio/pcm;rate=24000)

Description:
We are building a real-time voice agent using the Gemini Live API. Audio is received from Azure Communication Services (ACS) via WebSocket. ACS provides an IsSilent flag on each audio frame and filters silent frames (returns null instead of audio bytes).

Problem 1: Server-side VAD never triggers end-of-speech
With AutomaticActivityDetection configured:
RealtimeInputConfig = new RealtimeInputConfig {
    AutomaticActivityDetection = new AutomaticActivityDetection {
        PrefixPaddingMs = 200,
        SilenceDurationMs = 500,
        EndOfSpeechSensitivity = EndSensitivity.EndSensitivityHigh,
        StartOfSpeechSensitivity = StartSensitivity.StartSensitivityHigh
    },
    TurnCoverage = TurnCoverage.TurnIncludesOnlyActivity,
    ActivityHandling = ActivityHandling.StartOfActivityInterrupts
}

We only send non-silent audio frames via SendRealtimeInputAsync. Since ACS filters silence, no audio is sent during pauses. The server-side VAD apparently requires continuous audio (including silence) to detect the speech-to-silence transition and trigger end-of-speech based on SilenceDurationMs. Without receiving silence frames, the VAD never fires, and the model falls back to a long internal timeout (~10-40 seconds) before responding.
Workaround attempted — sending zero-filled silence buffers:
We tried sending 960-byte zero-filled PCM buffers (20ms at 24kHz/16-bit) when ACS reports silence. This partially worked but:
•	With TurnIncludesAllInput, the model context grew rapidly (50 frames/sec of silence), causing increasing response latency
•	With TurnIncludesOnlyActivity, response time improved but was still inconsistent
•	Interruption (user speaking during model output) was unreliable — the model was slow to detect ActivityStart from zero-filled frames

Problem 2: Explicit VAD (Disabled = true) causes immediate session close
We then tried switching to explicit VAD to send ActivityStart/ActivityEnd signals based on ACS's IsSilent flag:
RealtimeInputConfig = new RealtimeInputConfig {
    AutomaticActivityDetection = new AutomaticActivityDetection {
        Disabled = true
    },
    TurnCoverage = TurnCoverage.TurnIncludesOnlyActivity,
    ActivityHandling = ActivityHandling.StartOfActivityInterrupts
}

With this config, the session completes setup successfully (SetupComplete received), but the server immediately closes the WebSocket after we send SendClientContentAsync with the initial greeting. ReceiveAsync(ArraySegment<byte>, CancellationToken) returns null right after setup.
We verified:
•	Without Disabled = true (using automatic VAD), the session stays alive and the model responds
•	With Disabled = true, the session always closes immediately after setup
This makes explicit VAD unusable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutomaticActivityDetection in Gemini Live API does not detect end-of-speech when audio stream has gaps (silent frames not sent) #269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AutomaticActivityDetection in Gemini Live API does not detect end-of-speech when audio stream has gaps (silent frames not sent) #269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions