-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
audioRelated to audio input or output qualityRelated to audio input or output qualitybugSomething isn't workingSomething isn't working
Description
Problem
The Voice Activity Detection (VAD) system is clipping speech too aggressively, resulting in significant accuracy degradation during dictation sessions.
Symptoms
- Leading edge clipping: First words of utterances are frequently cut off
- Aggressive end detection: Segments end very quickly, not capturing complete thoughts
- Compounding effect: In a single conversational turn with 10 segments, if 8 segments have their first word clipped, the transcription is missing 8 words total
- Critical impact: This severely degrades transcription accuracy
Root Cause
Current VAD tuning parameters are too aggressive for natural speech patterns. The Silero VAD v5 configuration needs adjustment to:
- Better capture the start of speech (leading edge)
- Allow more time before declaring end-of-speech
- Provide more tolerance for brief pauses within continuous speech
Current VAD Parameters
From src/vad/VADConfigs.ts:
High Sensitivity (most commonly used for dictation):
positiveSpeechThreshold: 0.35,
negativeSpeechThreshold: 0.2,
redemptionFrames: 12,
minSpeechFrames: 2,
preSpeechPadFrames: 3,Balanced (default):
positiveSpeechThreshold: 0.4,
negativeSpeechThreshold: 0.25,
redemptionFrames: 10,
minSpeechFrames: 3,
preSpeechPadFrames: 2,Recommended Investigation
- Increase
preSpeechPadFrames: Currently 2-3 frames. Try 5-8 frames to capture more leading speech - Increase
redemptionFrames: Currently 10-12 frames. Try 15-20 frames for longer tail capture and tolerance for brief pauses - Adjust
negativeSpeechThreshold: Currently 0.2-0.25. Try 0.15-0.2 to avoid premature segment ends - Test with real audio: Use actual conversation recordings to validate parameter changes
Testing Approach
- Enable
keepSegmentsconfig to persist captured audio segments - Record test dictation sessions with known content
- Compare captured WAV files against expected speech
- Measure:
- Number of segments per utterance
- Percentage of segments with leading/trailing clipping
- Word accuracy before/after parameter tuning
Related Code
- src/vad/VADConfigs.ts - Parameter presets
- src/offscreen/vad_handler.ts - VAD initialization
- src/state-machines/AudioInputMachine.ts - Audio segment capture
Priority
Critical - This issue directly impacts core transcription accuracy and user experience.
Discovered during testing of dual-phase transcription feature (#256)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
audioRelated to audio input or output qualityRelated to audio input or output qualitybugSomething isn't workingSomething isn't working