Feat/openvoice pipeline stabilization#2
Open
NonMundaneDev wants to merge 14 commits intomainfrom
Open
Conversation
- controlvc.py: default device to None (auto-detect CUDA/CPU) instead of hardcoded "cuda" - model_embedding_vae.py: parameterize input_dim (default 256) so VAE works with non-ControlVC backends like Vec2Wav2 (1024-dim) - anonymizer.py: restore vae_input_dim/vae_latent_dim params; pull clip_threshold from wrapper if available - examples/controlvc_extract_commonvoice.py: replace hardcoded /home/jnear paths with argparse CLI args
Rewrites the VAE with GELU/LayerNorm architecture and label-aware training so the first K latent dims can be forced to match style features (e.g. happy, whisper, sad) during training and overridden at inference via control_features. Core changes: - model_embedding_vae.py: new encoder/decoder with reparameterization trick, control_features support, clip thresholds - utils.py: train_autoencoder gains labels dict and configurable lr for label-aware MSE loss on leading latent dims - anonymizer.py: switch to vae_config dict pattern, pass control_features - wrapper.py/controlvc.py/openvoice.py: add get_vae_config() to base and both backends, auto-detect CPU/CUDA device - pyproject.toml: add runtime deps (torchaudio, librosa, soundfile, etc.) New example scripts for Expresso dataset workflow: - controlvc_extract_expresso.py: extract embeddings + one-hot style labels - controlvc_train_vae_expresso.py: train controllable VAE on extracted data - controlvc_infer_controllable.py: CLI for style-controlled anonymization
These directories contain generated artifacts (extracted embeddings and audio output files) that should not be tracked in version control.
Adapts the ControlVC extraction pipeline for OpenVoice, which embeds F0/prosody in its speaker embedding. This should enable the controllable VAE to produce audible style differences (unlike ControlVC's D_VECTOR which doesn't encode style).
…t loading - extract_embedding() now calls extract_se() directly instead of get_se(), which ran VAD-based splitting that failed on short Expresso utterances (<10s). This fixes the 75% skip rate (was 15/20 failures, now 0/8712 failures). - Extraction script uses pandas for reliable parquet loading, bypassing HuggingFace datasets library issues with incomplete cache. - Export OpenVoiceWrapper from __init__.py.
OpenVoice's converter expects [1, 256, 1] shaped embeddings, but the VAE outputs [1, 256]. Auto-unsqueeze in inference() to bridge the gap.
91 speakers × 6 emotions (anger, disgust, fear, happy, neutral, sad). By default extracts one sample per speaker per emotion (546 samples) to maximize speaker diversity for VAE training.
Combined model (v6) achieves perceptually distinct output for all 9 styles: anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper. Key insight: speaker diversity (91 CREMA-D speakers) + style richness (Expresso whisper/confused/enunciated) together produce the best results. Happy now shows 2x F0 variation over baseline — first time it's worked.
FINDINGS.md: 6 key findings from controllable DP-VC experiments - Finding 1: ControlVC embeddings don't encode style, OpenVoice does - Finding 2: Speaker diversity (91+) critical for learning generalizable style - Finding 3: Style control survives DP noise (graceful degradation) - Finding 4: Acoustic signatures match emotional expectations (9 styles) - Finding 5: Embedding separability alone doesn't predict VAE success - Finding 6: Style generalizes across speakers via brightness, not F0 WORKLOG.md: Full roadmap (Phases 1-4) with Phase 1 complete, Phase 1.5 (CommonVoice pre-training per Joe's suggestion) planned. source_speakers/: 4 CREMA-D clips used for diverse speaker evaluation.
…ntation - openvoice_train_vae_combined.py: trains controllable VAE on combined CREMA-D + Expresso embeddings (the missing step between dataset building and inference) - openvoice_infer_controllable.py: CLI for style-controlled anonymization with --style, --all-styles, --noise-level, --style-strength - examples/README.md: full end-to-end reproduction guide (steps 0-5), style reference table, troubleshooting, file inventory
Three new eval scripts backing the EmoVoice-style evaluation approach: - eval_emotion.py: emotion2vec_plus_large Recall Rate + emo_sim - eval_wer.py: Whisper-based drift-from-baseline WER (plus fixed-reference mode) - eval_mos.py: torchaudio SQUIM_SUBJECTIVE predicted MOS (substitutes for UTMOS) FINDINGS.md adds per-finding methodology preambles (what the metric is, how it works, design choices, project relevance) and a Takeaway column on every results table across Findings 2, 3, 6, 7, 8, 9. Interpretations translate numbers into plain-language implications for a non-context reader. WORKLOG.md and examples/README.md updated for the new scripts.
Commits the three full-corpus evaluation CSVs (emotion2vec Recall/emo_sim, Whisper drift-WER, SQUIM predicted MOS) as the primary-source artifacts behind Findings 7-9. Small enough to track (≤38KB each, 258 rows). results/README.md documents each CSV's schema and the steps to regenerate from scratch.
- pyproject.toml: new 'eval' optional-dependency extra (funasr, openai-whisper, jiwer) so 'pip install -e ".[openvoice,eval]"' provisions the full evaluation pipeline. Previously eval deps had to be pip-installed inline from examples/README.md instructions. - README.md: point first-time readers at the controllable pipeline (examples/README.md, FINDINGS.md, WORKLOG.md, results/). Install instructions now show the three extras. Keeps the basic DP anonymization example for the library's original entry point. - .gitignore: add processed/ (stale tmp dirs from past runs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.