Feat/openvoice pipeline stabilization by NonMundaneDev · Pull Request #2 · uvm-plaid/dpvc

NonMundaneDev · 2026-04-29T11:38:48Z

No description provided.

- controlvc.py: default device to None (auto-detect CUDA/CPU) instead of hardcoded "cuda" - model_embedding_vae.py: parameterize input_dim (default 256) so VAE works with non-ControlVC backends like Vec2Wav2 (1024-dim) - anonymizer.py: restore vae_input_dim/vae_latent_dim params; pull clip_threshold from wrapper if available - examples/controlvc_extract_commonvoice.py: replace hardcoded /home/jnear paths with argparse CLI args

Rewrites the VAE with GELU/LayerNorm architecture and label-aware training so the first K latent dims can be forced to match style features (e.g. happy, whisper, sad) during training and overridden at inference via control_features. Core changes: - model_embedding_vae.py: new encoder/decoder with reparameterization trick, control_features support, clip thresholds - utils.py: train_autoencoder gains labels dict and configurable lr for label-aware MSE loss on leading latent dims - anonymizer.py: switch to vae_config dict pattern, pass control_features - wrapper.py/controlvc.py/openvoice.py: add get_vae_config() to base and both backends, auto-detect CPU/CUDA device - pyproject.toml: add runtime deps (torchaudio, librosa, soundfile, etc.) New example scripts for Expresso dataset workflow: - controlvc_extract_expresso.py: extract embeddings + one-hot style labels - controlvc_train_vae_expresso.py: train controllable VAE on extracted data - controlvc_infer_controllable.py: CLI for style-controlled anonymization

These directories contain generated artifacts (extracted embeddings and audio output files) that should not be tracked in version control.

Adapts the ControlVC extraction pipeline for OpenVoice, which embeds F0/prosody in its speaker embedding. This should enable the controllable VAE to produce audible style differences (unlike ControlVC's D_VECTOR which doesn't encode style).

…t loading - extract_embedding() now calls extract_se() directly instead of get_se(), which ran VAD-based splitting that failed on short Expresso utterances (<10s). This fixes the 75% skip rate (was 15/20 failures, now 0/8712 failures). - Extraction script uses pandas for reliable parquet loading, bypassing HuggingFace datasets library issues with incomplete cache. - Export OpenVoiceWrapper from __init__.py.

OpenVoice's converter expects [1, 256, 1] shaped embeddings, but the VAE outputs [1, 256]. Auto-unsqueeze in inference() to bridge the gap.

91 speakers × 6 emotions (anger, disgust, fear, happy, neutral, sad). By default extracts one sample per speaker per emotion (546 samples) to maximize speaker diversity for VAE training.

Combined model (v6) achieves perceptually distinct output for all 9 styles: anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper. Key insight: speaker diversity (91 CREMA-D speakers) + style richness (Expresso whisper/confused/enunciated) together produce the best results. Happy now shows 2x F0 variation over baseline — first time it's worked.

FINDINGS.md: 6 key findings from controllable DP-VC experiments - Finding 1: ControlVC embeddings don't encode style, OpenVoice does - Finding 2: Speaker diversity (91+) critical for learning generalizable style - Finding 3: Style control survives DP noise (graceful degradation) - Finding 4: Acoustic signatures match emotional expectations (9 styles) - Finding 5: Embedding separability alone doesn't predict VAE success - Finding 6: Style generalizes across speakers via brightness, not F0 WORKLOG.md: Full roadmap (Phases 1-4) with Phase 1 complete, Phase 1.5 (CommonVoice pre-training per Joe's suggestion) planned. source_speakers/: 4 CREMA-D clips used for diverse speaker evaluation.

…ntation - openvoice_train_vae_combined.py: trains controllable VAE on combined CREMA-D + Expresso embeddings (the missing step between dataset building and inference) - openvoice_infer_controllable.py: CLI for style-controlled anonymization with --style, --all-styles, --noise-level, --style-strength - examples/README.md: full end-to-end reproduction guide (steps 0-5), style reference table, troubleshooting, file inventory

Three new eval scripts backing the EmoVoice-style evaluation approach: - eval_emotion.py: emotion2vec_plus_large Recall Rate + emo_sim - eval_wer.py: Whisper-based drift-from-baseline WER (plus fixed-reference mode) - eval_mos.py: torchaudio SQUIM_SUBJECTIVE predicted MOS (substitutes for UTMOS) FINDINGS.md adds per-finding methodology preambles (what the metric is, how it works, design choices, project relevance) and a Takeaway column on every results table across Findings 2, 3, 6, 7, 8, 9. Interpretations translate numbers into plain-language implications for a non-context reader. WORKLOG.md and examples/README.md updated for the new scripts.

Commits the three full-corpus evaluation CSVs (emotion2vec Recall/emo_sim, Whisper drift-WER, SQUIM predicted MOS) as the primary-source artifacts behind Findings 7-9. Small enough to track (≤38KB each, 258 rows). results/README.md documents each CSV's schema and the steps to regenerate from scratch.

- pyproject.toml: new 'eval' optional-dependency extra (funasr, openai-whisper, jiwer) so 'pip install -e ".[openvoice,eval]"' provisions the full evaluation pipeline. Previously eval deps had to be pip-installed inline from examples/README.md instructions. - README.md: point first-time readers at the controllable pipeline (examples/README.md, FINDINGS.md, WORKLOG.md, results/). Install instructions now show the three extras. Keeps the basic DP anonymization example for the library's original entry point. - .gitignore: add processed/ (stale tmp dirs from past runs).

NonMundaneDev added 14 commits April 8, 2026 14:36

Add embeddings/ and output/ to .gitignore

fe5dea7

These directories contain generated artifacts (extracted embeddings and audio output files) that should not be tracked in version control.

Add OpenVoice Expresso extraction script

58cf45d

Adapts the ControlVC extraction pipeline for OpenVoice, which embeds F0/prosody in its speaker embedding. This should enable the controllable VAE to produce audible style differences (unlike ControlVC's D_VECTOR which doesn't encode style).

Add embedding shape handling for OpenVoice inference

68cc489

OpenVoice's converter expects [1, 256, 1] shaped embeddings, but the VAE outputs [1, 256]. Auto-unsqueeze in inference() to bridge the gap.

Add CREMA-D extraction script for OpenVoice

8b3af75

91 speakers × 6 emotions (anger, disgust, fear, happy, neutral, sad). By default extracts one sample per speaker per emotion (546 samples) to maximize speaker diversity for VAE training.

Stabilize OpenVoice pipeline reproducibility

4a49126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/openvoice pipeline stabilization#2

Feat/openvoice pipeline stabilization#2
NonMundaneDev wants to merge 14 commits intomainfrom
feat/openvoice-pipeline-stabilization

NonMundaneDev commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NonMundaneDev commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant