Skip to content

feat(engine): implement dynamic VAD thresholds for playback echo supp…#52

Merged
jjleng merged 1 commit intomainfrom
dynamic-vad
Apr 16, 2026
Merged

feat(engine): implement dynamic VAD thresholds for playback echo supp…#52
jjleng merged 1 commit intomainfrom
dynamic-vad

Conversation

@jjleng
Copy link
Copy Markdown
Contributor

@jjleng jjleng commented Apr 16, 2026

…ression

@jjleng
Copy link
Copy Markdown
Contributor Author

jjleng commented Apr 16, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 16, 2026

Claude finished @jjleng's task in 1m 37s —— View job


Code Review

  • Gather context
  • Read changed files
  • Analyze and post review

Overall this is a clean, well-motivated change. The design decisions are sound and the code is clear. A few observations below, ordered by importance.

What's well done

  • Named constants over magic numbersVAD_THRESHOLD_IDLE/PLAYBACK/PLAYBACK_RAW with doc-comments explaining why each value is chosen is much better than the previous hardcoded 0.7. Future tuning will be easy.
  • Packet-level threshold decision — setting the threshold once before process_frames() rather than once per frame is correct and efficient.
  • Dedup guard in VadStage::set_threshold — the if self.inner.threshold() != threshold check avoids unnecessary writes on every packet. Good micro-optimization.
  • Path-specific thresholds — distinguishing denoised (0.85) vs raw (0.90) paths is architecturally correct; the doc-comments make the reasoning explicit.

Observations

1. Mid-utterance threshold elevation

[reactor/audio.rs:36-40] If the bot starts playing audio while the user is already speaking, the threshold jumps from 0.70 → 0.85 mid-utterance. Frames where the VAD probability falls in the [0.70, 0.85) range will now register as silence, incrementing silence_frame_count. After 6 frames (~192ms) the VAD would emit a premature SpeechEnd, silently dropping the barge-in.

Whether this is acceptable depends on how often a bot starts playback while the user is mid-speech. If that scenario occurs (e.g., re-engagement prompts) it could cause silent barge-in failures that are hard to debug. A conservative guard:

// Only elevate threshold between utterances — mid-speech, keep the gate open.
if !self.vad.inner.is_speaking() {
    self.vad.set_threshold(if is_playing { ... } else { ... });
}

…though that would need a public is_speaking() on VadStage. Alternatively, documenting the known behaviour is sufficient if this scenario is considered unlikely. Fix this →

2. Denoiser-off edge case

[reactor/audio.rs:37] When denoise_enabled == false, audio reaches VAD unfiltered — the same situation as the native multimodal path — yet the Reactor path still uses VAD_THRESHOLD_PLAYBACK = 0.85 instead of VAD_THRESHOLD_PLAYBACK_RAW = 0.90. If echo suppression in raw audio was the motivation for the higher raw threshold, this path should arguably use VAD_THRESHOLD_PLAYBACK_RAW when the denoiser is disabled. Worth considering, even if it's an uncommon config.

3. set_min_volume() is added but not wired up

Both SileroVad and VadStage gain set_min_volume() accessors, but nothing in this PR calls them during playback switching. Raising min_volume during playback would provide a second layer of echo gating (amplitude-level, before the neural network). If this is deferred intentionally, a small comment like // reserved for future playback gating would prevent confusion.

4. No tests for threshold switching

SileroVad::process_with_prob was clearly designed for unit testing without the ONNX model. A simple test asserting that VadStage::set_threshold changes detection behavior (e.g., prob=0.80 → SpeechStart at idle threshold, but silent at playback threshold) would meaningfully increase confidence here. Low effort given the existing test scaffolding.


Minor nits

  • [reactor/audio.rs:27-31] The new is_playing comment block is inserted mid-way through the existing SAFETY NOTE comment, breaking its flow. Suggest moving the is_playing block and its comment after the safety note ends (after the closing paren on line 26 in the original).
  • [reactor/proc/vad.rs:22] float != float is correct here since values come from named constants, but worth noting if this comparison is ever extended to computed thresholds.

@jjleng
Copy link
Copy Markdown
Contributor Author

jjleng commented Apr 16, 2026

#51

@jjleng jjleng linked an issue Apr 16, 2026 that may be closed by this pull request
@jjleng
Copy link
Copy Markdown
Contributor Author

jjleng commented Apr 16, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Apr 16, 2026

Claude finished @jjleng's task in 1m 50s —— View job


Unable to access issue #51


I've reviewed all five changed files in this PR, but I'm unable to access issue #51 — both gh issue view and web fetch are blocked by the current tool permissions.

To unblock me, you can either:

  1. Grant gh CLI read permissions — add gh issue view * to the project's .claude/settings.json allowlist (or run /less-permission-prompts to auto-generate an allowlist), or

  2. Paste the issue body here — copy the issue title + description into a reply so I can act on it directly.

Once I can read the issue I'll know exactly what to implement or review.

@jjleng jjleng merged commit 039fde0 into main Apr 16, 2026
1 check passed
@jjleng jjleng deleted the dynamic-vad branch April 16, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamic VAD Thresholding to Prevent False Barge-ins During Playback

1 participant