Skip to content

feat: accuracy cross-validation + fix reflect padding#11

Merged
wavekat-eason merged 5 commits intomainfrom
feat/accuracy-cross-validation
Mar 29, 2026
Merged

feat: accuracy cross-validation + fix reflect padding#11
wavekat-eason merged 5 commits intomainfrom
feat/accuracy-cross-validation

Conversation

@wavekat-eason
Copy link
Copy Markdown
Contributor

Summary

  • Add end-to-end cross-validation of the Rust pipeline against the Python (Pipecat) reference
  • Fix a silent preprocessing bug: STFT was using zero-padding instead of reflect-padding

What's in this PR

Cross-validation infrastructure

  • scripts/gen_reference.py — generates reference.json and *.mel.npy fixtures from the Python pipeline
  • tests/fixtures/ — committed WAV clips, reference.json, and mel tensors
  • tests/accuracy.rs — regression tests asserting Rust probability is within ±0.02 of Python
  • make accuracy — prints a per-backend markdown table of probability diffs
  • make mel — compares [80×800] mel tensors element-wise, reporting max/mean diff and outlier locations
  • .github/workflows/turn-accuracy.yml — manual workflow that runs the report and opens a PR to update the README table

Bug fix: reflect padding
STFT center-padding was using zeros; WhisperFeatureExtractor uses np.pad(mode="reflect"). The fix drops mel max diff from 0.087 → 0.00001 (noise floor only) and probability diffs from 0.0017/0.0138 → 0.0000/0.0051.

Test plan

  • cargo test --features pipecat — all tests pass
  • make accuracy — all clips PASS within ±0.02
  • make mel — max diff ≤ 0.00002 for all clips

🤖 Generated with Claude Code

wavekat-eason and others added 5 commits March 29, 2026 13:43
- Record three fixture WAV clips (silence, finished, mid-utterance)
- Add scripts/gen_reference.py to generate Python-side probabilities
- Commit tests/fixtures/reference.json as the baseline
- Add tests/accuracy.rs with per-clip regression tests and a
  multi-backend accuracy_report table (make accuracy)
- Add .github/workflows/turn-accuracy.yml to auto-update README table
- Add Accuracy section to README with benchmark-table placeholders

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix STFT center-padding: was using zeros, now uses reflect mode
  to match WhisperFeatureExtractor (np.pad mode="reflect")
  This drops mel max diff from 0.087 → 0.00001 (FFT noise floor)
  and probability diffs from [0.0017, 0.0138] → [0.0000, 0.0051]
- Save per-clip mel tensors as .npy in gen_reference.py
- Add mel_report unit test (make mel) comparing [80×800] tensors
  element-wise with max/mean diff and outlier fraction
- Fix i16→f32 normalization: 32767 → 32768 to match soundfile
- Add ndarray-npy dev dependency for .npy loading in tests
- Add make mel target

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mel tensors are 768 KB of opaque binary that can't be diffed.
make mel is a dev diagnostic tool — .npy files are regenerated
locally by scripts/gen_reference.py alongside reference.json.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Gate fixtures_dir, RefEntry, load_reference under #[cfg(any(feature = "pipecat"))]
  to avoid dead-code warnings in no-feature builds
- Move reference_prob and load_wav_f32 into pipecat mod (only used there)
- Suppress unused_mut on the rows accumulator in accuracy_report
- Run cargo fmt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@wavekat-eason wavekat-eason merged commit cf2e498 into main Mar 29, 2026
5 checks passed
@wavekat-eason wavekat-eason deleted the feat/accuracy-cross-validation branch March 29, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant