diff --git a/.gitignore b/.gitignore index b398e39..1b5e88a 100644 --- a/.gitignore +++ b/.gitignore @@ -24,4 +24,9 @@ __pycache__/ # Generated audio test outputs test_output*.wav -test_cli_output*.wav \ No newline at end of file +test_cli_output*.wav + +# Generated data and outputs +embeddings/ +output/ +processed/ \ No newline at end of file diff --git a/FINDINGS.md b/FINDINGS.md new file mode 100644 index 0000000..bdcfd0b --- /dev/null +++ b/FINDINGS.md @@ -0,0 +1,349 @@ +# Key Findings — Controllable DP Voice Conversion + +**Last updated:** 2026-04-17 (methodology preambles + per-row Takeaway columns added to Findings 2, 3, 6, 7, 8, 9) +**Authors:** Stephen Oladele, Joe Near + +--- + +## Thesis + +Controllable differentially private voice conversion: anonymize a speaker's identity via calibrated noise on speaker embeddings, while explicitly controlling perceptual style attributes (emotion, whisper, etc.) in the output. No prior work combines DP guarantees with explicit style controllability. + +--- + +## Finding 1: Speaker Embeddings Encode Style Only in Some VC Systems + +**ControlVC's D_VECTOR does not encode style.** +- Separability ratio: 0.88 (need >>1) +- Between-style centroid distance (0.030) < within-style variance (0.034) +- Even at 5x amplification of style latent dims, zero audible difference +- Root cause: D_VECTOR encodes speaker identity; F0/prosody are handled separately + +**OpenVoice's speaker embedding does encode style.** +- OpenVoice embeds F0/prosody profile into the speaker embedding +- Whisper achieves separability ratio 1.30 (clearly separable) +- Confirmed by Joe: "OpenVoice embeds the F0 profile in the speaker embedding" + +**Implication for the field:** Researchers must verify that their chosen VC system's speaker embedding actually carries the features they want to control. This is not guaranteed and varies across systems. + +--- + +## Finding 2: Speaker Diversity is Critical for Learning Generalizable Style + +**3 speakers (Expresso only):** VAE memorizes speaker-specific tonal patterns instead of learning style. +- Only whisper (ratio 1.30) and sad (ratio 0.39) produce audible differences +- Happy/laughing/confused indistinguishable from baseline +- Within-style variance dominated by speaker identity, not emotion + +**91 speakers (CREMA-D):** VAE forced to learn cross-speaker emotion patterns. +- All 6 emotions acoustically distinct +- Happy works for the first time (highest F0 variance) + +**94 speakers (CREMA-D + Expresso combined):** Best of both worlds. +- All 9 styles perceptually distinct and acoustically confirmed +- Happy shows 2x F0 variation over baseline + +**Progression:** + +| Model | Speakers | Working styles | Happy? | Takeaway | +|-------|----------|---------------|--------|----------| +| ControlVC + Expresso | 3 | 0 of 11 | No | Embedding doesn't carry style at all — no amount of VAE training can recover what isn't there. Dead end for ControlVC. | +| OpenVoice + Expresso | 3 | 2 of 11 (whisper, sad) | No | Right embedding, wrong training set — too few speakers, so the VAE memorizes speaker identity instead of learning generalizable style. Only the most extreme styles (whisper, sad) break through. | +| OpenVoice + CREMA-D | 91 | 6 of 6 | Yes | Speaker diversity is the unlock — with 91 speakers the VAE is forced to disentangle style from identity. Happy works for the first time. | +| OpenVoice + Combined | 94 | 9 of 9 | Yes | Combines CREMA-D's 91-speaker backbone with Expresso's unique styles (confused, enunciated, whisper). All 9 styles controllable. | + +--- + +## Finding 3: Style Control Survives Differential Privacy Noise + +Tested at noise levels 0, 0.1, 0.3, 0.5, and 1.0 with the combined VAE (v6, 9 styles, 94 speakers). + +**Metric — spectral centroid (brightness).** Spectral centroid is the "center of mass" of the short-time power spectrum, reported in Hz. A higher centroid means more energy sits in the high frequencies, which listeners perceive as a brighter or sharper voice (e.g., whisper, anger); a lower centroid is dimmer/darker (e.g., neutral, sad). We use deltas from the uncontrolled baseline, so units are Hz-shift relative to what OpenVoice would have produced with no style control. **Why brightness and not F0:** Finding 6 showed that brightness is the one acoustic correlate that moves consistently across source speakers; F0 does not. Brightness is therefore the metric we trust for measuring whether style control is working. + +**Noise scale.** Gaussian DP noise with standard deviation equal to `noise_level` is added to the VAE-decoded speaker embedding. `noise=0` is no privacy; `noise=1.0` is heavy privacy (formal ε TBD — see Open Questions). + +**Brightness (spectral centroid) deltas from baseline persist across noise levels:** + +| Style | noise=0 | noise=0.1 | noise=0.3 | noise=0.5 | noise=1.0 | Takeaway | +|-------|---------|-----------|-----------|-----------|-----------|----------| +| whisper | +945 | +631 | +280 | +393 | +266 | Brightest style at every noise level — whisper's high-frequency breathy signature is the most privacy-robust of any style we tested. | +| anger | +369 | +177 | -136 | -93 | -113 | Direction *flips* under noise — the bright-edge signature of anger is fragile past noise=0.3. Anger is the least noise-robust controllable style. | +| neutral | -533 | -678 | -759 | -527 | -442 | Most stable downward shift — neutral's "dim, subdued" signature stays intact across every noise level. Safe choice under strong privacy. | +| sad | -272 | -414 | -705 | -668 | -633 | Consistent downward shift like neutral — confirms sad is acoustically closer to the low-energy end than to anger/whisper. | + +**Style differences diminish but do not disappear under DP noise.** This is the expected privacy-utility tradeoff — higher noise provides stronger privacy at the cost of some style fidelity. + +**Unexpected finding:** At high noise levels (0.3+), the uncontrolled baseline becomes unintelligible (F0 collapses to 0), but style-controlled outputs retain speech-like qualities. **Style control partially rescues speech intelligibility from noise destruction.** This means controllable DP-VC produces *better* outputs than uncontrolled DP-VC at the same privacy level. + +--- + +## Finding 4: Acoustic Signatures Match Emotional Expectations + +Combined model (v6) acoustic analysis, all deltas from uncontrolled baseline: + +| Style | dF0 Mean | dF0 Std | dF0 Range | dBrightness | Acoustic signature | +|-------|----------|---------|-----------|-------------|-------------------| +| anger | +13.2 | -3.9 | +4.4 | +369 | Sharp, bright, edgy | +| confused | +2.5 | -3.8 | -29.5 | +111 | Hesitant, narrow range | +| disgust | -25.6 | +6.1 | +24.3 | -156 | Low, withdrawn | +| enunciated | -10.5 | -3.9 | +5.2 | +274 | Crisp, bright articulation | +| fear | +17.4 | +4.8 | +80.8 | -155 | Tense, highly variable | +| happy | +26.8 | +33.2 | +240.7 | +84 | Most animated — 2x F0 variation | +| neutral | -46.0 | +6.7 | +38.8 | -533 | Flat, subdued, dimmest | +| sad | -0.3 | +11.2 | +44.3 | -272 | Darker, more varied | +| whisper | -128.6 | -30.8 | -143.6 | +945 | F0 near zero, breathy/airy | + +These patterns align with established prosodic correlates of emotion in the speech science literature. + +--- + +## Finding 5: Embedding Separability Alone Does Not Predict VAE Success + +CREMA-D raw embedding separability ratio (0.54) was *worse* than Expresso (0.62), yet the CREMA-D VAE produced better perceptual results. This is because: + +1. Linear centroid analysis misses nonlinear structure that the VAE encoder learns +2. Speaker diversity matters more than raw separability — with 91 speakers, the VAE has enough variation to disentangle style from identity +3. The VAE's label loss during training is a better predictor of downstream controllability than raw embedding separability + +--- + +## Finding 6: Style Generalizes Across Speakers via Brightness, Not F0 + +Tested on 5 source speakers (1 known male + 4 CREMA-D speakers with 5-8s audio) at noise=0, control_strength=5.0. + +**Spectral brightness (centroid) is direction-consistent across speakers for 7/9 styles:** + +| Style | trump | spk1007 | spk1023 | spk1045 | spk1076 | Consistent? | Takeaway | +|-------|-------|---------|---------|---------|---------|-------------|----------| +| anger | +795 | collapsed | +364 | +561 | +398 | YES | Brighter in every non-collapsed speaker — the one collapse is a speaker-specific edge case, not a control failure. | +| confused | +536 | +84 | -180 | -79 | +71 | NO | "Confused" doesn't have a single brightness signature — speakers express hesitation via different acoustic means (pause structure, pitch contour, etc.), not consistent spectral shift. | +| enunciated | +759 | -412 | -731 | -616 | -658 | NO | Sign flip between trump (+759) and all CREMA-D speakers (negative) — enunciation is likely domain-sensitive; it behaves differently on scripted CREMA-D voices than on the conversational Trump sample. | +| fear | +259 | +580 | +375 | +558 | +106 | YES | Consistently brighter — tense, elevated formants hold across all tested speakers. | +| happy | +550 | +342 | +297 | +191 | +290 | YES | Reliable — happy voices are measurably brighter in every speaker we tested. Matches Finding 4's 2× F0 variation. | +| neutral | -325 | collapsed | collapsed | -43 | -184 | YES | Direction consistent where defined, but two collapses suggest "neutral" can push already-neutral source voices into degenerate territory. | +| sad | +117 | +358 | +21 | +243 | +57 | YES | Surprisingly upward — CREMA-D's "sad" has a tense, pleading quality rather than a mellow one, so it lands as slightly *brighter*, not darker. | +| whisper | +973 | +956 | +350 | +720 | +758 | YES | Strongest and most consistent cross-speaker signal of any style — whisper is the one style that generalizes nearly perfectly. | + +**F0 direction is NOT consistent** — only 1/9 styles (happy) showed the same F0 shift direction across all speakers. This is expected: OpenVoice encodes timbre/tonal color in the speaker embedding, not F0 directly. F0 changes are a secondary effect that depends on how each source voice interacts with the modified embedding. + +**Collapses:** 4/45 outputs (9%) had F0=0 (unintelligible). This occurred when the combination of source speaker embedding + style control pushed the reconstruction into a degenerate region. Affects anger and neutral for spk1007, and disgust/neutral for spk1023. + +**At noise=0.1:** Brightness consistency maintained (7/9), but collapses increased to 5/45 (11%). The privacy noise makes some speakers more vulnerable to collapse. + +**Control strength tradeoff:** +- strength=5.0: 7/9 consistent, 4 collapses +- strength=3.0: 3/9 consistent, 2 collapses +- strength=2.0: 4/9 consistent, 2 collapses + +Higher control strength gives better cross-speaker consistency but increases collapse risk. This suggests that the latent dimensions need to be pushed far enough to dominate the source speaker's baseline characteristics, but this can exceed the decoder's valid input range for some speakers. + +**Joe's feedback (April 16 call):** F0 inconsistency is expected and not a problem — different people express emotion through pitch differently. F0 is not a knob we should control; the knobs are the style labels themselves (anger, happy, etc.). Brightness and F0 are measurement metrics, not user-facing controls. Joe also noted that collapses are expected when the VAE goes outside its training distribution and don't require a perfect fix. + +**Implication:** The system works across diverse speakers for the majority of styles, with spectral brightness as the reliable cross-speaker acoustic correlate. Papers should report brightness as the primary measurement metric rather than F0. + +--- + +## Finding 7: Emotion Classification Reveals a Training Gap, Not a Control Gap + +### Methodology + +**What emotion2vec_plus_large is.** A self-supervised universal speech-emotion encoder (Ma et al., ACL 2024; released as `iic/emotion2vec_plus_large` on HuggingFace, loaded here through `funasr`). It takes a raw waveform and produces (1) a fixed-dimensional emotion embedding and (2) softmax probabilities over 9 Chinese-English bilingual labels (`angry`, `disgusted`, `fearful`, `happy`, `neutral`, `sad`, `surprised`, `other`, ``). It is the current state-of-the-art universal emotion encoder, and it is what the EmoVoice paper (arxiv 2504.12867) uses to benchmark emotional TTS — so using it keeps us numerically comparable to the EmoVoice evaluation pipeline. + +**How we score each file.** For every generated `.wav` we run emotion2vec, take the argmax label, strip the bilingual prefix (e.g. `生气/angry` → `angry`), and compare it to our target style. Three of our nine styles (`confused`, `enunciated`, `whisper`) have no emotion2vec counterpart — we report emo_sim only for those, with no Recall Rate. + +**Design choice — `_plus_large` over the base model.** The `_plus` variant is fine-tuned on additional labeled emotion corpora (IEMOCAP-style), which gives higher absolute accuracy on our CREMA-D/Expresso-style inputs than the base SSL model. This is the same variant EmoVoice reports, so the numbers are directly comparable. + +**Primary metric — Recall Rate.** Does the argmax predicted emotion equal the target style? This is the EmoVoice paper's primary controllability metric. Random-chance baseline over 9 classes ≈ 11%. + +**Secondary metric — emo_sim.** Cosine similarity between the emotion2vec embedding of a generated file and the embedding of the *same speaker's uncontrolled baseline*. Measures how far the style control pushes the file through emotion-embedding space, regardless of whether the push lands on the target label. Useful for (a) styles emotion2vec can't label directly, and (b) sanity-checking that "zero recall" doesn't mean "identical to baseline." + +**How this relates to the project.** Recall Rate is the headline number Joe asked us to produce before anything else (April 16 call). Getting it above chance proves the VAE controls emotion; getting it to >50% is the bar for a paper-quality result. emo_sim is our fallback signal when the classifier can't help us. + +### Results — initial 5-speaker run at control_strength=5.0, noise=0.0: + +| Style | Recall Rate | emotion2vec prediction | Takeaway | +|-------|------------|----------------------|----------| +| anger | 2/5 (40%) | mixed: `angry`, `` | Partial success — 2 samples land on target; the rest are off-distribution enough that emotion2vec refuses to label them. | +| disgust | 0/5 (0%) | mostly `disgusted` → `sad`/`` | Classifier *does* see disgust-like features but not strongly enough to clear the argmax threshold. Signal is there, calibration is off. | +| fear | 0/5 (0%) | `worried`/`happy`/other | emotion2vec's "fearful" class is narrower than our "fear" — training-data mismatch, not a control failure. | +| happy | 0/5 (0%) | `disgusted`/`` | The most striking miss. Our "happy" is acoustically real (Finding 4: 2× F0 variation) but emotion2vec labels it as disgust — the latent direction may point at sarcasm/mockery, not joy. | +| neutral | 3/5 (60%) | mostly correct | Neutral is the easiest category for any classifier and the smallest perturbation — expected ceiling. | +| sad | 1/5 (20%) | `neutral`/`disgusted` | Our sad comes out flat enough to read as neutral — our acoustic "sad" doesn't match emotion2vec's prototype. | +| **Overall** | **6/30 (20%)** | | Roughly 2× random chance, well short of the >50% target the paper needs. | + +**Full-corpus run (258 files, 27 speaker-variant configs, strength sweep included):** + +Reproduced via `python examples/eval_emotion.py --input output/diverse_speakers/ --out output/eval_emotion_full.csv`. + +| Style | Recall Rate | Mean emo_sim | emo_sim range | Takeaway | +|-------|-------------|--------------|---------------|----------| +| anger | 8/27 (30%) | 0.937 | 0.851–0.998 | Best recall of any emotion — anger has a recognizable acoustic signature (sharp, bright) that survives the CREMA-D → wild-audio transfer. | +| disgust | 3/27 (11%) | 0.970 | 0.939–0.994 | Recall ≈ random chance — acoustic signature exists but is too subtle for emotion2vec to pick out reliably. | +| fear | 0/27 (0%) | 0.960 | 0.893–0.998 | Zero recall despite high emo_sim: we *are* perturbing the embedding, but not in a direction emotion2vec recognizes as fear. Training-distribution mismatch. | +| happy | 0/27 (0%) | 0.961 | 0.849–0.999 | Same pattern — happy is acoustically real (Finding 4) but classifier-invisible. The strongest evidence that recall is a training-coverage gap, not an architecture gap. | +| neutral | 18/27 (67%) | 0.976 | 0.889–0.998 | Strongest category — validates that the pipeline works end-to-end when the target's acoustic signature is broad enough for emotion2vec. | +| sad | 7/27 (26%) | 0.966 | 0.898–0.998 | Above chance, below ceiling — similar training-gap story as disgust. | +| **Overall** | **36/162 (22.2%)** | — | — | Consistent with the 20% 5-speaker run — not a sample-size problem. Motivates CommonVoice pre-training (Phase 1.5). | +| confused | n/a (no e2v class) | 0.943 | 0.888–0.990 | Embedding is clearly shifted from baseline — confused is a real style even without a recall number. | +| enunciated | n/a (no e2v class) | 0.959 | 0.925–0.995 | Least embedding-disruptive of the non-emotional styles — closest to baseline in emotion-space. | +| whisper | n/a (no e2v class) | **0.875** | **0.616–0.994** | Most embedding-disruptive style of any kind. Validates emo_sim as a useful probe for non-labeled controls: it reveals that whisper genuinely perturbs the emotional signature, even though no classifier class exists for it. | + +Per-style emo_sim ranges and the 20% vs 22.2% difference are both consistent with a training-coverage issue rather than a measurement artifact. + +**Control strength test on a single speaker (Trump), noise=0.0:** + +| Style | s=5.0 | s=10.0 | s=20.0 | Takeaway | +|-------|-------|--------|--------|----------| +| anger | `` | `` | `` | Strength doesn't help — the Trump-speaker "anger" region is simply outside emotion2vec's label space at every intensity. | +| disgust | `neutral` | `neutral` | `disgusted` ✓ | Strength *does* help — disgust reaches target at s=20. Latent direction is correct but weakly encoded; pushing harder works. | +| fear | `disgusted` | `disgusted` | `disgusted` | Latent direction is *wrong* (not weak) — pushing harder just makes it "more disgusted." A strength dial cannot fix a direction error. | +| happy | `disgusted` | `disgusted` | `disgusted` | Same story as fear — the axis we labeled "happy" points into emotion2vec's disgust region. Evidence the VAE learned *something*, but mislabeled. | +| neutral | `sad` | `sad` | `sad` | Odd — controlled "neutral" reads as sad. Likely the label encoding pulls toward low-energy regions during training. | +| sad | `neutral` | `sad` ✓ | `sad` ✓ | Recoverable with strength — sad is weakly encoded but directionally correct. | + +**Interpretation of the strength sweep:** the three outcomes — "strength fixes it" (disgust, sad), "strength doesn't help" (anger), "strength makes it more wrong" (fear, happy) — tell us exactly where the training gap lives. Anger/fear/happy need *better training data*, not a bigger strength dial. This is the direct motivation for Phase 1.5 (CommonVoice pre-training). + +**Interpretation:** Increasing control strength does help some styles (sad, disgust reach their target at higher values), but doesn't fix anger, happy, or fear. This is not a strength problem — the latent dimensions for those styles don't map cleanly to the emotion2vec categories the classifier was trained on. The VAE learned acoustic features for emotion, but the specific acoustic signature it produces for e.g. "angry" may differ from what emotion2vec considers "angry." + +**Root cause hypothesis:** The model was trained on CREMA-D (scripted emotional speech, 91 speakers) and Expresso (expressive storytelling speech, 3 speakers). Both datasets have different recording conditions from natural conversational speech. The emotion representations the VAE learns may be dataset-specific rather than universal. Training on a larger and more diverse base (e.g., CommonVoice for acoustic pre-training) should improve cross-domain generalisation. + +**emo_sim scores (0.87–0.97) are consistently high** — every output is emotionally coherent and close in embedding space to baseline speech. This is expected: we're controlling a few latent dimensions, not rewriting the full embedding. + +**Whisper is the most embedding-disruptive style (new observation, April 17).** Across all 15 whisper outputs, emo_sim mean = 0.875 and min = 0.616 — substantially lower than every other style. This validates emo_sim as a secondary signal for styles that emotion2vec cannot classify directly (whisper, confused, enunciated): even when Recall Rate is undefined, emo_sim reveals whether the style is actually perturbing the emotional signature. Whisper perturbs it the most, which matches its acoustic profile (near-zero F0, breathy/airy texture). + +**Implication:** The evaluation pipeline is working — emotion2vec is sensitive enough to detect the failures. The next step is improving training coverage, not changing the architecture. CommonVoice pre-training (Phase 1.5) is directly motivated by this finding. + +--- + +## Finding 8: Style Control Preserves Intelligibility (WER Sanity Check) + +### Methodology + +**What WER is.** Word Error Rate is the standard ASR evaluation metric: (substitutions + deletions + insertions) / reference words, after normalization. Lower is better; 0 means the hypothesis and reference transcripts are identical. We compute WER with `jiwer`, which handles the alignment plus a normalization chain (lowercase, strip punctuation, collapse whitespace) so that `"Hello, world."` and `"hello world"` score as identical. + +**What we use as the ASR.** OpenAI Whisper `base` model — 74M params, multilingual. We chose `base` over `large` because this is a **sanity check**, not a transcription benchmark: we only need it to be sensitive enough to detect intelligibility loss caused by our control knob. `base` is fast (~real-time on CPU) and its error modes are well-understood. + +**Design choice — drift-from-baseline, not absolute WER.** For each `_