Skip to content

Feat/openvoice pipeline stabilization#2

Open
NonMundaneDev wants to merge 14 commits intomainfrom
feat/openvoice-pipeline-stabilization
Open

Feat/openvoice pipeline stabilization#2
NonMundaneDev wants to merge 14 commits intomainfrom
feat/openvoice-pipeline-stabilization

Conversation

@NonMundaneDev
Copy link
Copy Markdown
Collaborator

No description provided.

- controlvc.py: default device to None (auto-detect CUDA/CPU) instead of hardcoded "cuda"
- model_embedding_vae.py: parameterize input_dim (default 256) so VAE works with non-ControlVC backends like Vec2Wav2 (1024-dim)
- anonymizer.py: restore vae_input_dim/vae_latent_dim params; pull clip_threshold from wrapper if available
- examples/controlvc_extract_commonvoice.py: replace hardcoded /home/jnear paths with argparse CLI args
Rewrites the VAE with GELU/LayerNorm architecture and label-aware training
so the first K latent dims can be forced to match style features (e.g. happy,
whisper, sad) during training and overridden at inference via control_features.

Core changes:
- model_embedding_vae.py: new encoder/decoder with reparameterization trick,
  control_features support, clip thresholds
- utils.py: train_autoencoder gains labels dict and configurable lr for
  label-aware MSE loss on leading latent dims
- anonymizer.py: switch to vae_config dict pattern, pass control_features
- wrapper.py/controlvc.py/openvoice.py: add get_vae_config() to base and
  both backends, auto-detect CPU/CUDA device
- pyproject.toml: add runtime deps (torchaudio, librosa, soundfile, etc.)

New example scripts for Expresso dataset workflow:
- controlvc_extract_expresso.py: extract embeddings + one-hot style labels
- controlvc_train_vae_expresso.py: train controllable VAE on extracted data
- controlvc_infer_controllable.py: CLI for style-controlled anonymization
These directories contain generated artifacts (extracted embeddings
and audio output files) that should not be tracked in version control.
Adapts the ControlVC extraction pipeline for OpenVoice, which embeds
F0/prosody in its speaker embedding. This should enable the controllable
VAE to produce audible style differences (unlike ControlVC's D_VECTOR
which doesn't encode style).
…t loading

- extract_embedding() now calls extract_se() directly instead of get_se(),
  which ran VAD-based splitting that failed on short Expresso utterances (<10s).
  This fixes the 75% skip rate (was 15/20 failures, now 0/8712 failures).
- Extraction script uses pandas for reliable parquet loading, bypassing
  HuggingFace datasets library issues with incomplete cache.
- Export OpenVoiceWrapper from __init__.py.
OpenVoice's converter expects [1, 256, 1] shaped embeddings, but the VAE
outputs [1, 256]. Auto-unsqueeze in inference() to bridge the gap.
91 speakers × 6 emotions (anger, disgust, fear, happy, neutral, sad).
By default extracts one sample per speaker per emotion (546 samples)
to maximize speaker diversity for VAE training.
Combined model (v6) achieves perceptually distinct output for all 9 styles:
anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper.

Key insight: speaker diversity (91 CREMA-D speakers) + style richness
(Expresso whisper/confused/enunciated) together produce the best results.
Happy now shows 2x F0 variation over baseline — first time it's worked.
FINDINGS.md: 6 key findings from controllable DP-VC experiments
- Finding 1: ControlVC embeddings don't encode style, OpenVoice does
- Finding 2: Speaker diversity (91+) critical for learning generalizable style
- Finding 3: Style control survives DP noise (graceful degradation)
- Finding 4: Acoustic signatures match emotional expectations (9 styles)
- Finding 5: Embedding separability alone doesn't predict VAE success
- Finding 6: Style generalizes across speakers via brightness, not F0

WORKLOG.md: Full roadmap (Phases 1-4) with Phase 1 complete,
Phase 1.5 (CommonVoice pre-training per Joe's suggestion) planned.

source_speakers/: 4 CREMA-D clips used for diverse speaker evaluation.
…ntation

- openvoice_train_vae_combined.py: trains controllable VAE on combined
  CREMA-D + Expresso embeddings (the missing step between dataset
  building and inference)
- openvoice_infer_controllable.py: CLI for style-controlled anonymization
  with --style, --all-styles, --noise-level, --style-strength
- examples/README.md: full end-to-end reproduction guide (steps 0-5),
  style reference table, troubleshooting, file inventory
Three new eval scripts backing the EmoVoice-style evaluation approach:
- eval_emotion.py: emotion2vec_plus_large Recall Rate + emo_sim
- eval_wer.py: Whisper-based drift-from-baseline WER (plus fixed-reference mode)
- eval_mos.py: torchaudio SQUIM_SUBJECTIVE predicted MOS (substitutes for UTMOS)

FINDINGS.md adds per-finding methodology preambles (what the metric is, how
it works, design choices, project relevance) and a Takeaway column on every
results table across Findings 2, 3, 6, 7, 8, 9. Interpretations translate
numbers into plain-language implications for a non-context reader.

WORKLOG.md and examples/README.md updated for the new scripts.
Commits the three full-corpus evaluation CSVs (emotion2vec Recall/emo_sim,
Whisper drift-WER, SQUIM predicted MOS) as the primary-source artifacts
behind Findings 7-9. Small enough to track (≤38KB each, 258 rows).

results/README.md documents each CSV's schema and the steps to regenerate
from scratch.
- pyproject.toml: new 'eval' optional-dependency extra (funasr,
  openai-whisper, jiwer) so 'pip install -e ".[openvoice,eval]"'
  provisions the full evaluation pipeline. Previously eval deps had
  to be pip-installed inline from examples/README.md instructions.
- README.md: point first-time readers at the controllable pipeline
  (examples/README.md, FINDINGS.md, WORKLOG.md, results/). Install
  instructions now show the three extras. Keeps the basic DP anonymization
  example for the library's original entry point.
- .gitignore: add processed/ (stale tmp dirs from past runs).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant