feat: accuracy cross-validation + fix reflect padding#11
Merged
wavekat-eason merged 5 commits intomainfrom Mar 29, 2026
Merged
Conversation
- Record three fixture WAV clips (silence, finished, mid-utterance) - Add scripts/gen_reference.py to generate Python-side probabilities - Commit tests/fixtures/reference.json as the baseline - Add tests/accuracy.rs with per-clip regression tests and a multi-backend accuracy_report table (make accuracy) - Add .github/workflows/turn-accuracy.yml to auto-update README table - Add Accuracy section to README with benchmark-table placeholders Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix STFT center-padding: was using zeros, now uses reflect mode to match WhisperFeatureExtractor (np.pad mode="reflect") This drops mel max diff from 0.087 → 0.00001 (FFT noise floor) and probability diffs from [0.0017, 0.0138] → [0.0000, 0.0051] - Save per-clip mel tensors as .npy in gen_reference.py - Add mel_report unit test (make mel) comparing [80×800] tensors element-wise with max/mean diff and outlier fraction - Fix i16→f32 normalization: 32767 → 32768 to match soundfile - Add ndarray-npy dev dependency for .npy loading in tests - Add make mel target Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mel tensors are 768 KB of opaque binary that can't be diffed. make mel is a dev diagnostic tool — .npy files are regenerated locally by scripts/gen_reference.py alongside reference.json. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Gate fixtures_dir, RefEntry, load_reference under #[cfg(any(feature = "pipecat"))] to avoid dead-code warnings in no-feature builds - Move reference_prob and load_wav_f32 into pipecat mod (only used there) - Suppress unused_mut on the rows accumulator in accuracy_report - Run cargo fmt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What's in this PR
Cross-validation infrastructure
scripts/gen_reference.py— generatesreference.jsonand*.mel.npyfixtures from the Python pipelinetests/fixtures/— committed WAV clips,reference.json, and mel tensorstests/accuracy.rs— regression tests asserting Rust probability is within ±0.02 of Pythonmake accuracy— prints a per-backend markdown table of probability diffsmake mel— compares[80×800]mel tensors element-wise, reporting max/mean diff and outlier locations.github/workflows/turn-accuracy.yml— manual workflow that runs the report and opens a PR to update the README tableBug fix: reflect padding
STFT center-padding was using zeros;
WhisperFeatureExtractorusesnp.pad(mode="reflect"). The fix drops mel max diff from 0.087 → 0.00001 (noise floor only) and probability diffs from 0.0017/0.0138 → 0.0000/0.0051.Test plan
cargo test --features pipecat— all tests passmake accuracy— all clips PASS within ±0.02make mel— max diff ≤ 0.00002 for all clips🤖 Generated with Claude Code