Skip to content

🐛 VAD Aggressively Clips Speech - Missing First/Last Words #260

@rosscado

Description

@rosscado

Problem

The Voice Activity Detection (VAD) system is clipping speech too aggressively, resulting in significant accuracy degradation during dictation sessions.

Symptoms

  1. Leading edge clipping: First words of utterances are frequently cut off
  2. Aggressive end detection: Segments end very quickly, not capturing complete thoughts
  3. Compounding effect: In a single conversational turn with 10 segments, if 8 segments have their first word clipped, the transcription is missing 8 words total
  4. Critical impact: This severely degrades transcription accuracy

Root Cause

Current VAD tuning parameters are too aggressive for natural speech patterns. The Silero VAD v5 configuration needs adjustment to:

  • Better capture the start of speech (leading edge)
  • Allow more time before declaring end-of-speech
  • Provide more tolerance for brief pauses within continuous speech

Current VAD Parameters

From src/vad/VADConfigs.ts:

High Sensitivity (most commonly used for dictation):

positiveSpeechThreshold: 0.35,
negativeSpeechThreshold: 0.2,
redemptionFrames: 12,
minSpeechFrames: 2,
preSpeechPadFrames: 3,

Balanced (default):

positiveSpeechThreshold: 0.4,
negativeSpeechThreshold: 0.25,
redemptionFrames: 10,
minSpeechFrames: 3,
preSpeechPadFrames: 2,

Recommended Investigation

  1. Increase preSpeechPadFrames: Currently 2-3 frames. Try 5-8 frames to capture more leading speech
  2. Increase redemptionFrames: Currently 10-12 frames. Try 15-20 frames for longer tail capture and tolerance for brief pauses
  3. Adjust negativeSpeechThreshold: Currently 0.2-0.25. Try 0.15-0.2 to avoid premature segment ends
  4. Test with real audio: Use actual conversation recordings to validate parameter changes

Testing Approach

  1. Enable keepSegments config to persist captured audio segments
  2. Record test dictation sessions with known content
  3. Compare captured WAV files against expected speech
  4. Measure:
    • Number of segments per utterance
    • Percentage of segments with leading/trailing clipping
    • Word accuracy before/after parameter tuning

Related Code

Priority

Critical - This issue directly impacts core transcription accuracy and user experience.


Discovered during testing of dual-phase transcription feature (#256)

Metadata

Metadata

Assignees

No one assigned

    Labels

    audioRelated to audio input or output qualitybugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions