diff --git a/.gitignore b/.gitignore
index b398e39..1b5e88a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -24,4 +24,9 @@ __pycache__/
 
 # Generated audio test outputs
 test_output*.wav
-test_cli_output*.wav
\ No newline at end of file
+test_cli_output*.wav
+
+# Generated data and outputs
+embeddings/
+output/
+processed/
\ No newline at end of file
diff --git a/FINDINGS.md b/FINDINGS.md
new file mode 100644
index 0000000..bdcfd0b
--- /dev/null
+++ b/FINDINGS.md
@@ -0,0 +1,349 @@
+# Key Findings — Controllable DP Voice Conversion
+
+**Last updated:** 2026-04-17 (methodology preambles + per-row Takeaway columns added to Findings 2, 3, 6, 7, 8, 9)
+**Authors:** Stephen Oladele, Joe Near
+
+---
+
+## Thesis
+
+Controllable differentially private voice conversion: anonymize a speaker's identity via calibrated noise on speaker embeddings, while explicitly controlling perceptual style attributes (emotion, whisper, etc.) in the output. No prior work combines DP guarantees with explicit style controllability.
+
+---
+
+## Finding 1: Speaker Embeddings Encode Style Only in Some VC Systems
+
+**ControlVC's D_VECTOR does not encode style.**
+- Separability ratio: 0.88 (need >>1)
+- Between-style centroid distance (0.030) < within-style variance (0.034)
+- Even at 5x amplification of style latent dims, zero audible difference
+- Root cause: D_VECTOR encodes speaker identity; F0/prosody are handled separately
+
+**OpenVoice's speaker embedding does encode style.**
+- OpenVoice embeds F0/prosody profile into the speaker embedding
+- Whisper achieves separability ratio 1.30 (clearly separable)
+- Confirmed by Joe: "OpenVoice embeds the F0 profile in the speaker embedding"
+
+**Implication for the field:** Researchers must verify that their chosen VC system's speaker embedding actually carries the features they want to control. This is not guaranteed and varies across systems.
+
+---
+
+## Finding 2: Speaker Diversity is Critical for Learning Generalizable Style
+
+**3 speakers (Expresso only):** VAE memorizes speaker-specific tonal patterns instead of learning style.
+- Only whisper (ratio 1.30) and sad (ratio 0.39) produce audible differences
+- Happy/laughing/confused indistinguishable from baseline
+- Within-style variance dominated by speaker identity, not emotion
+
+**91 speakers (CREMA-D):** VAE forced to learn cross-speaker emotion patterns.
+- All 6 emotions acoustically distinct
+- Happy works for the first time (highest F0 variance)
+
+**94 speakers (CREMA-D + Expresso combined):** Best of both worlds.
+- All 9 styles perceptually distinct and acoustically confirmed
+- Happy shows 2x F0 variation over baseline
+
+**Progression:**
+
+| Model | Speakers | Working styles | Happy? | Takeaway |
+|-------|----------|---------------|--------|----------|
+| ControlVC + Expresso | 3 | 0 of 11 | No | Embedding doesn't carry style at all — no amount of VAE training can recover what isn't there. Dead end for ControlVC. |
+| OpenVoice + Expresso | 3 | 2 of 11 (whisper, sad) | No | Right embedding, wrong training set — too few speakers, so the VAE memorizes speaker identity instead of learning generalizable style. Only the most extreme styles (whisper, sad) break through. |
+| OpenVoice + CREMA-D | 91 | 6 of 6 | Yes | Speaker diversity is the unlock — with 91 speakers the VAE is forced to disentangle style from identity. Happy works for the first time. |
+| OpenVoice + Combined | 94 | 9 of 9 | Yes | Combines CREMA-D's 91-speaker backbone with Expresso's unique styles (confused, enunciated, whisper). All 9 styles controllable. |
+
+---
+
+## Finding 3: Style Control Survives Differential Privacy Noise
+
+Tested at noise levels 0, 0.1, 0.3, 0.5, and 1.0 with the combined VAE (v6, 9 styles, 94 speakers).
+
+**Metric — spectral centroid (brightness).** Spectral centroid is the "center of mass" of the short-time power spectrum, reported in Hz. A higher centroid means more energy sits in the high frequencies, which listeners perceive as a brighter or sharper voice (e.g., whisper, anger); a lower centroid is dimmer/darker (e.g., neutral, sad). We use deltas from the uncontrolled baseline, so units are Hz-shift relative to what OpenVoice would have produced with no style control. **Why brightness and not F0:** Finding 6 showed that brightness is the one acoustic correlate that moves consistently across source speakers; F0 does not. Brightness is therefore the metric we trust for measuring whether style control is working.
+
+**Noise scale.** Gaussian DP noise with standard deviation equal to `noise_level` is added to the VAE-decoded speaker embedding. `noise=0` is no privacy; `noise=1.0` is heavy privacy (formal ε TBD — see Open Questions).
+
+**Brightness (spectral centroid) deltas from baseline persist across noise levels:**
+
+| Style | noise=0 | noise=0.1 | noise=0.3 | noise=0.5 | noise=1.0 | Takeaway |
+|-------|---------|-----------|-----------|-----------|-----------|----------|
+| whisper | +945 | +631 | +280 | +393 | +266 | Brightest style at every noise level — whisper's high-frequency breathy signature is the most privacy-robust of any style we tested. |
+| anger | +369 | +177 | -136 | -93 | -113 | Direction *flips* under noise — the bright-edge signature of anger is fragile past noise=0.3. Anger is the least noise-robust controllable style. |
+| neutral | -533 | -678 | -759 | -527 | -442 | Most stable downward shift — neutral's "dim, subdued" signature stays intact across every noise level. Safe choice under strong privacy. |
+| sad | -272 | -414 | -705 | -668 | -633 | Consistent downward shift like neutral — confirms sad is acoustically closer to the low-energy end than to anger/whisper. |
+
+**Style differences diminish but do not disappear under DP noise.** This is the expected privacy-utility tradeoff — higher noise provides stronger privacy at the cost of some style fidelity.
+
+**Unexpected finding:** At high noise levels (0.3+), the uncontrolled baseline becomes unintelligible (F0 collapses to 0), but style-controlled outputs retain speech-like qualities. **Style control partially rescues speech intelligibility from noise destruction.** This means controllable DP-VC produces *better* outputs than uncontrolled DP-VC at the same privacy level.
+
+---
+
+## Finding 4: Acoustic Signatures Match Emotional Expectations
+
+Combined model (v6) acoustic analysis, all deltas from uncontrolled baseline:
+
+| Style | dF0 Mean | dF0 Std | dF0 Range | dBrightness | Acoustic signature |
+|-------|----------|---------|-----------|-------------|-------------------|
+| anger | +13.2 | -3.9 | +4.4 | +369 | Sharp, bright, edgy |
+| confused | +2.5 | -3.8 | -29.5 | +111 | Hesitant, narrow range |
+| disgust | -25.6 | +6.1 | +24.3 | -156 | Low, withdrawn |
+| enunciated | -10.5 | -3.9 | +5.2 | +274 | Crisp, bright articulation |
+| fear | +17.4 | +4.8 | +80.8 | -155 | Tense, highly variable |
+| happy | +26.8 | +33.2 | +240.7 | +84 | Most animated — 2x F0 variation |
+| neutral | -46.0 | +6.7 | +38.8 | -533 | Flat, subdued, dimmest |
+| sad | -0.3 | +11.2 | +44.3 | -272 | Darker, more varied |
+| whisper | -128.6 | -30.8 | -143.6 | +945 | F0 near zero, breathy/airy |
+
+These patterns align with established prosodic correlates of emotion in the speech science literature.
+
+---
+
+## Finding 5: Embedding Separability Alone Does Not Predict VAE Success
+
+CREMA-D raw embedding separability ratio (0.54) was *worse* than Expresso (0.62), yet the CREMA-D VAE produced better perceptual results. This is because:
+
+1. Linear centroid analysis misses nonlinear structure that the VAE encoder learns
+2. Speaker diversity matters more than raw separability — with 91 speakers, the VAE has enough variation to disentangle style from identity
+3. The VAE's label loss during training is a better predictor of downstream controllability than raw embedding separability
+
+---
+
+## Finding 6: Style Generalizes Across Speakers via Brightness, Not F0
+
+Tested on 5 source speakers (1 known male + 4 CREMA-D speakers with 5-8s audio) at noise=0, control_strength=5.0.
+
+**Spectral brightness (centroid) is direction-consistent across speakers for 7/9 styles:**
+
+| Style | trump | spk1007 | spk1023 | spk1045 | spk1076 | Consistent? | Takeaway |
+|-------|-------|---------|---------|---------|---------|-------------|----------|
+| anger | +795 | collapsed | +364 | +561 | +398 | YES | Brighter in every non-collapsed speaker — the one collapse is a speaker-specific edge case, not a control failure. |
+| confused | +536 | +84 | -180 | -79 | +71 | NO | "Confused" doesn't have a single brightness signature — speakers express hesitation via different acoustic means (pause structure, pitch contour, etc.), not consistent spectral shift. |
+| enunciated | +759 | -412 | -731 | -616 | -658 | NO | Sign flip between trump (+759) and all CREMA-D speakers (negative) — enunciation is likely domain-sensitive; it behaves differently on scripted CREMA-D voices than on the conversational Trump sample. |
+| fear | +259 | +580 | +375 | +558 | +106 | YES | Consistently brighter — tense, elevated formants hold across all tested speakers. |
+| happy | +550 | +342 | +297 | +191 | +290 | YES | Reliable — happy voices are measurably brighter in every speaker we tested. Matches Finding 4's 2× F0 variation. |
+| neutral | -325 | collapsed | collapsed | -43 | -184 | YES | Direction consistent where defined, but two collapses suggest "neutral" can push already-neutral source voices into degenerate territory. |
+| sad | +117 | +358 | +21 | +243 | +57 | YES | Surprisingly upward — CREMA-D's "sad" has a tense, pleading quality rather than a mellow one, so it lands as slightly *brighter*, not darker. |
+| whisper | +973 | +956 | +350 | +720 | +758 | YES | Strongest and most consistent cross-speaker signal of any style — whisper is the one style that generalizes nearly perfectly. |
+
+**F0 direction is NOT consistent** — only 1/9 styles (happy) showed the same F0 shift direction across all speakers. This is expected: OpenVoice encodes timbre/tonal color in the speaker embedding, not F0 directly. F0 changes are a secondary effect that depends on how each source voice interacts with the modified embedding.
+
+**Collapses:** 4/45 outputs (9%) had F0=0 (unintelligible). This occurred when the combination of source speaker embedding + style control pushed the reconstruction into a degenerate region. Affects anger and neutral for spk1007, and disgust/neutral for spk1023.
+
+**At noise=0.1:** Brightness consistency maintained (7/9), but collapses increased to 5/45 (11%). The privacy noise makes some speakers more vulnerable to collapse.
+
+**Control strength tradeoff:**
+- strength=5.0: 7/9 consistent, 4 collapses
+- strength=3.0: 3/9 consistent, 2 collapses
+- strength=2.0: 4/9 consistent, 2 collapses
+
+Higher control strength gives better cross-speaker consistency but increases collapse risk. This suggests that the latent dimensions need to be pushed far enough to dominate the source speaker's baseline characteristics, but this can exceed the decoder's valid input range for some speakers.
+
+**Joe's feedback (April 16 call):** F0 inconsistency is expected and not a problem — different people express emotion through pitch differently. F0 is not a knob we should control; the knobs are the style labels themselves (anger, happy, etc.). Brightness and F0 are measurement metrics, not user-facing controls. Joe also noted that collapses are expected when the VAE goes outside its training distribution and don't require a perfect fix.
+
+**Implication:** The system works across diverse speakers for the majority of styles, with spectral brightness as the reliable cross-speaker acoustic correlate. Papers should report brightness as the primary measurement metric rather than F0.
+
+---
+
+## Finding 7: Emotion Classification Reveals a Training Gap, Not a Control Gap
+
+### Methodology
+
+**What emotion2vec_plus_large is.** A self-supervised universal speech-emotion encoder (Ma et al., ACL 2024; released as `iic/emotion2vec_plus_large` on HuggingFace, loaded here through `funasr`). It takes a raw waveform and produces (1) a fixed-dimensional emotion embedding and (2) softmax probabilities over 9 Chinese-English bilingual labels (`angry`, `disgusted`, `fearful`, `happy`, `neutral`, `sad`, `surprised`, `other`, `<unk>`). It is the current state-of-the-art universal emotion encoder, and it is what the EmoVoice paper (arxiv 2504.12867) uses to benchmark emotional TTS — so using it keeps us numerically comparable to the EmoVoice evaluation pipeline.
+
+**How we score each file.** For every generated `.wav` we run emotion2vec, take the argmax label, strip the bilingual prefix (e.g. `生气/angry` → `angry`), and compare it to our target style. Three of our nine styles (`confused`, `enunciated`, `whisper`) have no emotion2vec counterpart — we report emo_sim only for those, with no Recall Rate.
+
+**Design choice — `_plus_large` over the base model.** The `_plus` variant is fine-tuned on additional labeled emotion corpora (IEMOCAP-style), which gives higher absolute accuracy on our CREMA-D/Expresso-style inputs than the base SSL model. This is the same variant EmoVoice reports, so the numbers are directly comparable.
+
+**Primary metric — Recall Rate.** Does the argmax predicted emotion equal the target style? This is the EmoVoice paper's primary controllability metric. Random-chance baseline over 9 classes ≈ 11%.
+
+**Secondary metric — emo_sim.** Cosine similarity between the emotion2vec embedding of a generated file and the embedding of the *same speaker's uncontrolled baseline*. Measures how far the style control pushes the file through emotion-embedding space, regardless of whether the push lands on the target label. Useful for (a) styles emotion2vec can't label directly, and (b) sanity-checking that "zero recall" doesn't mean "identical to baseline."
+
+**How this relates to the project.** Recall Rate is the headline number Joe asked us to produce before anything else (April 16 call). Getting it above chance proves the VAE controls emotion; getting it to >50% is the bar for a paper-quality result. emo_sim is our fallback signal when the classifier can't help us.
+
+### Results — initial 5-speaker run at control_strength=5.0, noise=0.0:
+
+| Style | Recall Rate | emotion2vec prediction | Takeaway |
+|-------|------------|----------------------|----------|
+| anger | 2/5 (40%) | mixed: `angry`, `<unk>` | Partial success — 2 samples land on target; the rest are off-distribution enough that emotion2vec refuses to label them. |
+| disgust | 0/5 (0%) | mostly `disgusted` → `sad`/`<unk>` | Classifier *does* see disgust-like features but not strongly enough to clear the argmax threshold. Signal is there, calibration is off. |
+| fear | 0/5 (0%) | `worried`/`happy`/other | emotion2vec's "fearful" class is narrower than our "fear" — training-data mismatch, not a control failure. |
+| happy | 0/5 (0%) | `disgusted`/`<unk>` | The most striking miss. Our "happy" is acoustically real (Finding 4: 2× F0 variation) but emotion2vec labels it as disgust — the latent direction may point at sarcasm/mockery, not joy. |
+| neutral | 3/5 (60%) | mostly correct | Neutral is the easiest category for any classifier and the smallest perturbation — expected ceiling. |
+| sad | 1/5 (20%) | `neutral`/`disgusted` | Our sad comes out flat enough to read as neutral — our acoustic "sad" doesn't match emotion2vec's prototype. |
+| **Overall** | **6/30 (20%)** | | Roughly 2× random chance, well short of the >50% target the paper needs. |
+
+**Full-corpus run (258 files, 27 speaker-variant configs, strength sweep included):**
+
+Reproduced via `python examples/eval_emotion.py --input output/diverse_speakers/ --out output/eval_emotion_full.csv`.
+
+| Style | Recall Rate | Mean emo_sim | emo_sim range | Takeaway |
+|-------|-------------|--------------|---------------|----------|
+| anger | 8/27 (30%) | 0.937 | 0.851–0.998 | Best recall of any emotion — anger has a recognizable acoustic signature (sharp, bright) that survives the CREMA-D → wild-audio transfer. |
+| disgust | 3/27 (11%) | 0.970 | 0.939–0.994 | Recall ≈ random chance — acoustic signature exists but is too subtle for emotion2vec to pick out reliably. |
+| fear | 0/27 (0%) | 0.960 | 0.893–0.998 | Zero recall despite high emo_sim: we *are* perturbing the embedding, but not in a direction emotion2vec recognizes as fear. Training-distribution mismatch. |
+| happy | 0/27 (0%) | 0.961 | 0.849–0.999 | Same pattern — happy is acoustically real (Finding 4) but classifier-invisible. The strongest evidence that recall is a training-coverage gap, not an architecture gap. |
+| neutral | 18/27 (67%) | 0.976 | 0.889–0.998 | Strongest category — validates that the pipeline works end-to-end when the target's acoustic signature is broad enough for emotion2vec. |
+| sad | 7/27 (26%) | 0.966 | 0.898–0.998 | Above chance, below ceiling — similar training-gap story as disgust. |
+| **Overall** | **36/162 (22.2%)** | — | — | Consistent with the 20% 5-speaker run — not a sample-size problem. Motivates CommonVoice pre-training (Phase 1.5). |
+| confused | n/a (no e2v class) | 0.943 | 0.888–0.990 | Embedding is clearly shifted from baseline — confused is a real style even without a recall number. |
+| enunciated | n/a (no e2v class) | 0.959 | 0.925–0.995 | Least embedding-disruptive of the non-emotional styles — closest to baseline in emotion-space. |
+| whisper | n/a (no e2v class) | **0.875** | **0.616–0.994** | Most embedding-disruptive style of any kind. Validates emo_sim as a useful probe for non-labeled controls: it reveals that whisper genuinely perturbs the emotional signature, even though no classifier class exists for it. |
+
+Per-style emo_sim ranges and the 20% vs 22.2% difference are both consistent with a training-coverage issue rather than a measurement artifact.
+
+**Control strength test on a single speaker (Trump), noise=0.0:**
+
+| Style | s=5.0 | s=10.0 | s=20.0 | Takeaway |
+|-------|-------|--------|--------|----------|
+| anger | `<unk>` | `<unk>` | `<unk>` | Strength doesn't help — the Trump-speaker "anger" region is simply outside emotion2vec's label space at every intensity. |
+| disgust | `neutral` | `neutral` | `disgusted` ✓ | Strength *does* help — disgust reaches target at s=20. Latent direction is correct but weakly encoded; pushing harder works. |
+| fear | `disgusted` | `disgusted` | `disgusted` | Latent direction is *wrong* (not weak) — pushing harder just makes it "more disgusted." A strength dial cannot fix a direction error. |
+| happy | `disgusted` | `disgusted` | `disgusted` | Same story as fear — the axis we labeled "happy" points into emotion2vec's disgust region. Evidence the VAE learned *something*, but mislabeled. |
+| neutral | `sad` | `sad` | `sad` | Odd — controlled "neutral" reads as sad. Likely the label encoding pulls toward low-energy regions during training. |
+| sad | `neutral` | `sad` ✓ | `sad` ✓ | Recoverable with strength — sad is weakly encoded but directionally correct. |
+
+**Interpretation of the strength sweep:** the three outcomes — "strength fixes it" (disgust, sad), "strength doesn't help" (anger), "strength makes it more wrong" (fear, happy) — tell us exactly where the training gap lives. Anger/fear/happy need *better training data*, not a bigger strength dial. This is the direct motivation for Phase 1.5 (CommonVoice pre-training).
+
+**Interpretation:** Increasing control strength does help some styles (sad, disgust reach their target at higher values), but doesn't fix anger, happy, or fear. This is not a strength problem — the latent dimensions for those styles don't map cleanly to the emotion2vec categories the classifier was trained on. The VAE learned acoustic features for emotion, but the specific acoustic signature it produces for e.g. "angry" may differ from what emotion2vec considers "angry."
+
+**Root cause hypothesis:** The model was trained on CREMA-D (scripted emotional speech, 91 speakers) and Expresso (expressive storytelling speech, 3 speakers). Both datasets have different recording conditions from natural conversational speech. The emotion representations the VAE learns may be dataset-specific rather than universal. Training on a larger and more diverse base (e.g., CommonVoice for acoustic pre-training) should improve cross-domain generalisation.
+
+**emo_sim scores (0.87–0.97) are consistently high** — every output is emotionally coherent and close in embedding space to baseline speech. This is expected: we're controlling a few latent dimensions, not rewriting the full embedding.
+
+**Whisper is the most embedding-disruptive style (new observation, April 17).** Across all 15 whisper outputs, emo_sim mean = 0.875 and min = 0.616 — substantially lower than every other style. This validates emo_sim as a secondary signal for styles that emotion2vec cannot classify directly (whisper, confused, enunciated): even when Recall Rate is undefined, emo_sim reveals whether the style is actually perturbing the emotional signature. Whisper perturbs it the most, which matches its acoustic profile (near-zero F0, breathy/airy texture).
+
+**Implication:** The evaluation pipeline is working — emotion2vec is sensitive enough to detect the failures. The next step is improving training coverage, not changing the architecture. CommonVoice pre-training (Phase 1.5) is directly motivated by this finding.
+
+---
+
+## Finding 8: Style Control Preserves Intelligibility (WER Sanity Check)
+
+### Methodology
+
+**What WER is.** Word Error Rate is the standard ASR evaluation metric: (substitutions + deletions + insertions) / reference words, after normalization. Lower is better; 0 means the hypothesis and reference transcripts are identical. We compute WER with `jiwer`, which handles the alignment plus a normalization chain (lowercase, strip punctuation, collapse whitespace) so that `"Hello, world."` and `"hello world"` score as identical.
+
+**What we use as the ASR.** OpenAI Whisper `base` model — 74M params, multilingual. We chose `base` over `large` because this is a **sanity check**, not a transcription benchmark: we only need it to be sensitive enough to detect intelligibility loss caused by our control knob. `base` is fast (~real-time on CPU) and its error modes are well-understood.
+
+**Design choice — drift-from-baseline, not absolute WER.** For each `<speaker>_<style>.wav` the reference is the *same speaker's* `<speaker>_baseline.wav` transcription, not a ground-truth script. This is a deliberate methodological choice: we don't care whether Whisper gets the words right in absolute terms (the source audio may be noisy, accented, or hard for Whisper regardless of our pipeline). We care whether the style control *changes what Whisper hears*. A drift-from-baseline WER of 0.15 means "style control caused Whisper to disagree with itself on 15% of words" — a clean attribution to our intervention. `eval_wer.py` also supports `--reference-text` for absolute WER on known scripts, which we'll use when we report on a held-out test set.
+
+**How this relates to the project.** One of the claims of a controllable DP-VC system is that the controls are orthogonal to the content channel — you change *how* something is said, not *what* is said. WER is the direct test of that claim. Without it we cannot rule out "style control works by scrambling the phonemes."
+
+### Results (258 files, 73 scored against a same-speaker baseline):
+
+| Style | Mean WER | Median WER | Min | Max | Takeaway |
+|-------|----------|------------|-----|-----|----------|
+| happy | **0.110** | 0.000 | 0 | 0.750 | Best-preserving style — median 0 means Whisper recovers the exact baseline transcript most of the time. Happy is "free" on the content channel. |
+| neutral | 0.130 | 0.000 | 0 | 0.750 | Expected — neutral is the smallest embedding perturbation. |
+| sad | 0.150 | 0.000 | 0 | 0.750 | Slow, flat delivery but ASR still recovers the words — median 0 confirms it. |
+| anger | 0.166 | 0.125 | 0 | 0.750 | Low mean WER despite the large acoustic shift — anger stays articulate. Content channel intact. |
+| fear | 0.188 | 0.200 | 0 | 0.750 | Moderate — tense, variable F0 hurts ASR slightly but not catastrophically. |
+| disgust | 0.200 | 0.143 | 0 | 0.750 | Moderate — low, withdrawn voice degrades recovery slightly. Still acceptable. |
+| enunciated | 0.244 | 0.214 | 0 | 0.600 | Counter-intuitive — "crisp articulation" doesn't help Whisper here, probably because Expresso's storytelling-style enunciation is stylistically unusual. |
+| confused | 0.275 | 0.250 | 0 | 1.000 | Hesitant delivery genuinely confuses the ASR — content drift is real, not just acoustic. |
+| **whisper** | **0.356** | **0.286** | 0 | **1.000** | Worst. Two factors compound: Whisper-the-model is broadly poor on whispered speech regardless, *and* high-strength whisper occasionally collapses entirely (max = 1.0). |
+
+**Interpretation:**
+
+1. **6 of 9 styles preserve intelligibility well** (median WER ≤ 0.20). For happy/neutral/sad the median is exactly 0 — Whisper recovers the same transcription as the baseline.
+2. **Whisper is the only style with systemic intelligibility loss** (mean 0.356). Part of this is intrinsic to whispered speech — ASR systems are broadly worse on whisper. But some of it is real degradation from our control, especially at high strength.
+3. **WER max of 1.00 occurs for whisper, confused, and a few outliers in other styles.** These are the collapse cases already catalogued in Finding 6 (9% of speaker-style combinations produce degenerate output) showing up in the content-recovery metric.
+4. **WER and emo_sim are consistent.** Whisper has both the lowest mean emo_sim (0.875 — largest embedding shift) AND the highest mean WER (0.356 — largest content drift). Both metrics agree that whisper is the style most perturbing to "normal" speech, which is what whisper should be.
+
+**Implication:** The style control is essentially **orthogonal to the content channel**. We can turn style knobs without breaking what the speaker says. This is the sanity-check result we needed before claiming "controllable speaker generation": the system genuinely controls speaker properties without damaging the content channel, which is the promise of separating speaker embedding from HuBERT content codes.
+
+---
+
+## Finding 9: Predicted MOS — Naturalness Holds for Most Styles
+
+### Methodology
+
+**What MOS is.** Mean Opinion Score — a standard 1-to-5 subjective scale for perceptual audio quality, where 1 = "bad" and 5 = "excellent." Traditionally collected via human listening panels; modern papers use **predicted MOS** from a model trained on large panels of human ratings (much cheaper, reproducible, highly correlated with real panels).
+
+**Which predictor we use — and why not UTMOS.** The EmoVoice paper (arxiv 2504.12867) uses UTMOS (Saeki et al., 2022), which is the dominant predicted-MOS model in TTS literature. UTMOS is distributed via `sarulab-speech/utmos`, which in turn requires a from-source build of fairseq. Our codebase monkey-patches fairseq for OpenVoice compatibility, and the two fairseq versions conflict — we cannot install UTMOS without breaking inference. We substitute torchaudio's `SQUIM_SUBJECTIVE` (Meta, 2023), which predicts the same quantity (subjective MOS on the same 1–5 scale) trained on comparable datasets (BVCC + DAPS). SQUIM and UTMOS are not numerically identical, but they measure the same construct and rank-order audio similarly — which is all we need for a *relative* comparison across styles.
+
+**How SQUIM_SUBJECTIVE works — the "non-matching reference" design.** SQUIM takes two inputs: the audio to score, and a *reference* waveform used only to condition the model (not to compare against). The reference can be any clean-speech file — the model uses it to calibrate its judgment of what "natural speech" looks like for this speaker/channel, then scores the test audio on its own merits. This is why `eval_mos.py` defaults to using each file's same-speaker baseline as reference: gives the model a consistent conditioning anchor per speaker. Pass `--reference` to override with a single fixed waveform.
+
+**Design choice — per-speaker baseline delta.** For each style we report `Δ vs same-speaker baseline = MOS(style) − MOS(baseline)` for that speaker. This removes per-speaker MOS variation (some source voices are just noisier than others) and isolates the style knob's naturalness cost.
+
+**How this relates to the project.** Recall Rate (Finding 7) tells us if the style hits the target; WER (Finding 8) tells us if the words survive; MOS tells us if the output still *sounds like a person*. Without MOS, a style-control system could achieve high recall and low WER by producing robotic, artifact-laden audio that a human would rate as unusable. MOS closes that loop.
+
+### Results (258 files, 150 scored — the others had no same-speaker baseline available):
+
+| Style | Mean MOS | Median | Range | Δ vs same-speaker baseline | Takeaway |
+|-------|----------|--------|-------|----------------------------|----------|
+| baseline | 4.055 | 3.980 | 3.925–4.448 | — | OpenVoice's naturalness ceiling — "good" quality before any style control is applied. Sets the upper bound. |
+| fear | 4.074 | 3.979 | 3.966–4.444 | +0.019 | Statistically indistinguishable from baseline — fear is free on the naturalness axis. |
+| sad | 4.068 | 3.978 | 3.898–4.438 | +0.013 | No measurable cost. |
+| neutral | 4.045 | 3.974 | 3.916–4.402 | −0.010 | No meaningful cost. |
+| happy | 4.025 | 3.984 | 3.392–4.411 | −0.030 | Tiny mean cost, but note the low end (3.39) — occasional samples are noticeably degraded, probably collapse-adjacent cases. |
+| disgust | 4.021 | 3.972 | 3.872–4.411 | −0.034 | Small cost, no outliers — disgust is a benign style in naturalness terms. |
+| enunciated | 3.932 | 3.947 | 3.069–4.448 | −0.123 | First style with a measurable cost, but still "good." The low-end (3.07) reveals that aggressive enunciation can sound unnaturally over-articulated. |
+| anger | 3.760 | 3.975 | **1.407**–4.377 | −0.294 (bimodal; median healthy) | Bimodal failure — median is still fine (3.98), but a few outliers crash to 1.4 MOS and drag the mean. These are the collapse cases from Finding 6 showing up in MOS. |
+| confused | 3.704 | 3.808 | 2.892–4.421 | −0.351 | Systemic cost — every sample is a little less natural, no clean cases. Hesitant/halting speech intrinsically scores lower. |
+| **whisper** | **3.623** | 3.904 | **2.503**–4.170 | **−0.432** | Worst naturalness cost overall — whisper is genuinely hard for any speech synthesis system because its acoustic profile (near-zero F0, breath) is atypical for the MOS model's training data. |
+
+**Interpretation:**
+
+1. **Baseline MOS ≈ 4.05** — the OpenVoice VC pipeline itself produces "good"-to-"very good" naturalness, before any style control is applied. This sets the naturalness ceiling we're operating under.
+2. **6 of 9 styles stay within 0.12 MOS of baseline** (fear, sad, neutral, happy, disgust, enunciated). The style knob preserves perceived naturalness for most emotions.
+3. **Three styles degrade naturalness measurably**: whisper (−0.43), confused (−0.35), anger (−0.29). The anger degradation is bimodal — median is fine (3.98) but high-strength trump samples drop to 1.4 MOS. Whisper and confused are systemic: every sample is somewhat degraded.
+
+**Cross-metric convergence (the most important observation).** All three independent evaluation metrics — emo_sim (Finding 7), WER (Finding 8), and predicted MOS (this finding) — rank the same three styles as hardest: **whisper, confused, anger**. Each metric measures a different quantity (embedding shift, content-channel drift, naturalness perception), yet they converge on the same ordering.
+
+| Style | emo_sim mean | WER mean | MOS mean | Worst-on-all? | Takeaway |
+|-------|--------------|----------|----------|---------------|----------|
+| whisper | 0.875 | 0.356 | 3.623 | ✓ | Hardest style on every axis — embedding shift, content drift, and naturalness all agree. This is where training coverage matters most. |
+| confused | 0.943 | 0.275 | 3.704 | ✓ | Second-hardest — hesitant delivery disrupts all three channels, but less severely than whisper. |
+| anger | 0.937 | 0.166 | 3.760 | ✓ (on MOS & emo_sim) | Third-hardest — good on WER, bimodal on MOS. Failures are collapse cases, not the style itself being unnatural. A collapse-detector would recover most of anger's MOS. |
+
+This triangulation is what we'd want to see before trusting any single metric. It means the signal is real — these styles are genuinely harder for our current model, not an artifact of any one evaluator.
+
+**Implication:** For the paper, we can report MOS to confirm that style control preserves naturalness for most styles, and flag whisper/confused/anger as the training-coverage edge cases. Combined with WER (intelligibility) and emotion2vec (target alignment), we have a three-axis evaluation that mirrors the EmoVoice (arxiv 2504.12867) pipeline.
+
+---
+
+## Open Questions
+
+1. **What are the formal privacy guarantees?** We need to compute epsilon for each noise level and report privacy-utility curves.
+2. ~~**Does style control generalize across source speakers?**~~ → **Answered in Finding 6.** Brightness generalizes (7/9 styles); F0 does not. Some speaker-style combinations collapse.
+3. ~~**How do we evaluate emotion controllability?**~~ → **Answered in Finding 7.** emotion2vec Recall Rate + emo_sim (per EmoVoice) is the primary metric. Recall is 20% — training gap identified.
+4. **Can CommonVoice pre-training improve recall?** Finding 7 shows the VAE's emotion representations are dataset-specific. Pre-training on a broader base should help. This is Phase 1.5.
+5. **Can we train age/gender and emotion knobs simultaneously?** CommonVoice has age/gender, CREMA-D has emotion. Can a single VAE learn all at once when each training stage only labels a subset? Unknown — Joe flagged this as an open research question.
+6. **Can an adversary re-identify speakers from F0 alone?** If so, embedding-only DP is insufficient — motivates joint protection.
+7. **What is the minimum speaker count for style learning?** We jumped from 3 to 91. Where's the threshold?
+8. **Can we interpolate between styles?** E.g., 50% happy + 50% sad — does the output sound bittersweet?
+9. **How to prevent collapses?** 9% of speaker-style combinations produce unintelligible output. Can we clamp the latent space or detect/reject bad combinations?
+
+---
+
+## Paper Framing
+
+**Title candidates:**
+1. "Controllable Differentially Private Voice Conversion"
+2. "Expressive Voice Anonymization with Formal Privacy Guarantees"
+3. "Style-Preserving Speaker Anonymization via Controllable VAE"
+
+**Core argument (refined after April 16 call + April 16 evening message from Joe):** The problem we solve is **controllable speaker generation for voice-to-voice systems** — a more general problem than either VoicePrivacy (which preserves existing emotion) or TTS (which generates from text). A single controllable VAE enables multiple downstream applications:
+
+1. **Identity change with controlled style** — anonymize a speaker while targeting a specific emotion (the demo we currently emphasize)
+2. **Emotion change without identity change** — modify style latent dims while keeping the rest of the embedding fixed; same-sounding person, different mood
+3. **Random speaker generation with control** — sample freely from the VAE to produce speakers that never existed, with specified properties
+4. **Random speaker near an existing one** — sample from a neighborhood in latent space around a reference speaker
+
+Privacy / DP noise is **one application** of use cases (3) and (4), not the paper's headline. Joe's assessment (April 16 evening): "I'm not sure about this, but our solution might end up being the state of the art for this more general thing."
+
+**Key results to highlight:**
+1. Negative result: not all VC embeddings encode style (ControlVC fails, OpenVoice works)
+2. Speaker diversity is the critical training requirement (3 vs 91 speakers)
+3. 9 styles controllable and acoustically verified
+4. Style differences survive DP noise (privacy-utility tradeoff, not cliff)
+5. Style control rescues intelligibility under heavy noise (bonus finding)
+6. Style generalizes across diverse speakers (7/9 via brightness, collapses expected)
+7. emotion2vec evaluation reveals training gap: 22.2% recall at current scale — motivates CommonVoice pre-training
+8. Style control preserves intelligibility (WER sanity check): 6 of 9 styles have median WER ≤ 0.20 against same-speaker baseline; whisper is the only style with systemic loss
+9. Predicted MOS confirms naturalness preservation for 6 of 9 styles; emo_sim + WER + MOS converge on the same three hardest styles (whisper, confused, anger) — cross-metric triangulation validates the signal
+
+**Evaluation approach (per Joe, April 16 + EmoVoice paper):**
+- **Primary:** emotion2vec Recall Rate + emo_sim (per EmoVoice pipeline) — measures whether generated outputs express the intended emotion
+- **Secondary:** Word error rate via Whisper — intelligibility sanity check only
+- **Tertiary:** Privacy/speaker verification (add as "we can also do this")
+- Baseline for recall: random chance = ~11% (9 classes); current model = 20%; target: >50% for paper
diff --git a/README.md b/README.md
index 264e883..45a7cf5 100644
--- a/README.md
+++ b/README.md
@@ -4,37 +4,88 @@ This repository provides a library for defining differentially private speaker a
 
 [Click here for full documentation](https://jnear.w3.uvm.edu/dpvc/)
 
+## Current work — controllable DP voice conversion
+
+Active branch: **`feat/cremad-experiments`**. We've extended the library with a **controllable** VAE that exposes 9 style knobs (anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper) on top of the DP anonymization pipeline. Primary entry points:
+
+- **[`examples/README.md`](examples/README.md)** — end-to-end reproduction guide (extraction → training → controllable inference → evaluation).
+- **[`FINDINGS.md`](FINDINGS.md)** — 9 key findings with methodology and per-row takeaways.
+- **[`WORKLOG.md`](WORKLOG.md)** — roadmap and progress tracking.
+- **[`results/`](results/)** — raw evaluation CSVs (emotion2vec Recall/emo_sim, WER, predicted MOS) backing the findings.
+
+OpenVoice is the **canonical controllable pipeline**. ControlVC remains in the
+repository as a useful DP baseline and wrapper reference, but not as the
+recommended path for style control.
+
 ## Installation
 
-Install the library by cloning this repository and then running:
+Clone this repository, then install with the extras you need. The active
+OpenVoice path is covered by package extras; the ControlVC baseline has a
+separate setup guide because it depends on an external repo plus Python 3.10 /
+fairseq compatibility work.
 
-```
-pip install .
-```
+```bash
+# Core library only
+pip install -e .
 
-## Example: OpenVoice
+# + OpenVoice backend (required for the controllable pipeline)
+pip install -e ".[openvoice]"
 
-The library provides a wrapper around the OpenVoice voice control system. A minimal example of using it is as follows:
+# + Expresso dataset extraction
+pip install -e ".[openvoice,expresso]"
 
+# + Evaluation pipeline (emotion2vec, Whisper WER, predicted MOS)
+pip install -e ".[openvoice,expresso,eval]"
 ```
+
+Tested Pass 1 environment in `.venv`:
+
+- `torch==2.9.1`
+- `torchaudio==2.9.1`
+- `numpy==2.3.5`
+- `librosa==0.9.1`
+- `soundfile==0.13.1`
+- `datasets==4.8.4`
+- `pandas==3.0.2`
+- `funasr==1.3.1`
+- `openai-whisper==20250625`
+- `jiwer==4.0.0`
+
+For ControlVC-specific setup, use [`docs/controlvc_setup.md`](docs/controlvc_setup.md).
+
+## Example: basic DP anonymization (OpenVoice)
+
+```python
 import dpvc
 vc_wrapper = dpvc.OpenVoiceWrapper()
 anonymizer = dpvc.Anonymizer(vc_wrapper)
 anonymizer.anonymize(src_path, output_path, noise_level=1.0)
 ```
 
-Here, `src_path` should be an input .wav file name, and `output_path` should be the output .wav file name. The `noise_level` parameter controls how much noise is added in the differential privacy step. The `OpenVoiceWrapper` object encapsulates the OpenVoice models, and the `anonymize` method performs the anonymization via differential privacy.
+`src_path` is an input .wav, `output_path` is the anonymized output, and `noise_level` controls the magnitude of DP noise added to the speaker embedding.
+
+See also:
 
-See the following files for examples of use:
+- `examples/openvoice_inference.py` — basic anonymization (no style control).
+- `examples/openvoice_train_vae.py` — train a custom DP-VAE for the anonymizer.
+- `examples/openvoice_infer_controllable.py` — **controllable** style-aware inference (the current headline flow; see [`examples/README.md`](examples/README.md) for the full pipeline).
+- `docs/controlvc_setup.md` — ControlVC baseline setup and smoke-test path.
 
-- `examples/openvoice_inference.py` contains a more complete example of anonymization using the OpenVoice wrapper
-- `examples/openvoice_train_vae.py` contains an example of how to train a custom DP-VAE for use in the anonymizer
+## Evaluation
+
+The evaluation scripts under `examples/` measure the three axes the EmoVoice paper uses:
+
+- `examples/eval_emotion.py` — emotion2vec_plus_large Recall Rate + emo_sim (target alignment)
+- `examples/eval_wer.py` — OpenAI Whisper drift-from-baseline Word Error Rate (content preservation)
+- `examples/eval_mos.py` — torchaudio SQUIM_SUBJECTIVE predicted MOS (naturalness)
+
+CSV outputs from our runs live in [`results/`](results/). Schemas and reproduction steps are in [`results/README.md`](results/README.md).
 
 ## Building Documentation
 
-The documentation is built with [MkDocs](https://www.mkdocs.org/). To build the documentation:
+The documentation is built with [MkDocs](https://www.mkdocs.org/):
 
-```
+```bash
 pip install mkdocs "mkdocstrings[python]" mkdocs-material
 mkdocs build
 ```
diff --git a/WORKLOG.md b/WORKLOG.md
new file mode 100644
index 0000000..df90817
--- /dev/null
+++ b/WORKLOG.md
@@ -0,0 +1,752 @@
+# Controllable DP Voice Conversion — Work Log
+
+**Last updated:** 2026-04-28
+**Branches:** `feat/controlvc`, `feat/openvoice-expresso`, `feat/f0-style-control`, `feat/cremad-experiments`, `feat/openvoice-pipeline-stabilization`
+**Author:** Stephen Oladele (with Claude, and Joe Near's upstream work)
+
+---
+
+## 0. Roadmap
+
+Priority tags:
+- `NOW` = immediate next work; unblocks reproducibility or addresses the main research bottleneck
+- `SOON` = important after the `NOW` items are stable
+- `LATER` = downstream, stretch, or productization work
+
+### Phase 0.5: Consolidate Active Path
+- [x] `[NOW]` Consolidate the active controllable pipeline around OpenVoice; treat ControlVC as a useful DP baseline and negative result for style control
+
+### Phase 1: Core Controllability (prove it works)
+- [x] End-to-end pipeline: extract → train VAE → infer → intelligible speech
+- [x] Demonstrate style control with OpenVoice (not ControlVC — Joe confirmed, April 9)
+- [x] Speaker diversity: train on 91+ speakers (CREMA-D) so VAE learns style, not speaker
+- [x] Combined dataset (CREMA-D + Expresso): 9 styles, 94 speakers, all perceptually distinct
+- [x] Acoustic analysis confirming style differences (F0, energy, spectral centroid)
+- [x] Test style control survives DP noise (noise_level 0–1.0, styles persist at 0.1, degrade gracefully)
+- [x] Test on diverse source speakers (5 speakers: brightness consistent 7/9 styles, F0 inconsistent, 9% collapse rate)
+- [x] Per-speaker style evaluation: brightness is the reliable cross-speaker metric, not F0
+
+### Phase 1.5: Scale Up Speaker Diversity (Joe's suggestion, April 16)
+- [ ] `[NOW]` Extract OpenVoice embeddings from CommonVoice (20k+ speakers, no style labels)
+- [ ] `[NOW]` Pre-train VAE on CommonVoice (reconstruction loss only)
+- [ ] `[NOW]` Finetune on CREMA-D + Expresso (style labels, label loss)
+- [ ] `[NOW]` Compare with current combined-only VAE: does pre-training reduce collapse rate?
+- [ ] `[SOON]` Add age/gender control dims using CommonVoice metadata (dims 9-10)
+- [ ] `[SOON]` Test orthogonality: does pushing emotion dims shift perceived age/gender?
+- [ ] `[SOON]` **Open question (Joe, April 16):** Can we train all knobs at once when labels come from different datasets? CommonVoice has age/gender, CREMA-D has emotion — each stage only trains a subset of latent dims
+
+### Phase 2: Evaluation (Joe: emotion eval is #1 priority)
+- [x] **Research TTS controllability evaluation metrics** — settled on EmoVoice pipeline (arxiv 2504.12867, Joe's suggestion): emotion2vec Recall Rate + emo_sim (primary), UTMOS (naturalness), WER (intelligibility)
+- [x] Build emotion evaluation pipeline: `examples/eval_emotion.py` runs emotion2vec_plus_large on a directory of generated audio and writes a CSV with per-file Recall Rate and emo_sim vs. same-speaker baseline (April 17)
+- [x] Run emotion eval on full 258-file diverse-speaker corpus: **36/162 = 22.2% overall recall** at strength=5.0; per-style neutral 67%, anger 30%, sad 26%, disgust 11%, fear 0%, happy 0%. Whisper has the largest emo_sim deviation (0.875 mean, 0.616 min) — emo_sim validated as a secondary signal for styles with no emotion2vec counterpart
+- [x] Word error rate via Whisper: `examples/eval_wer.py` runs Whisper `base` on a directory and computes per-file WER against the same-speaker baseline (April 17). **Result: 6 of 9 styles have median WER ≤ 0.20; whisper is the only style with systemic loss (mean 0.356). Style control is orthogonal to the content channel.**
+- [x] Predicted MOS via torchaudio SQUIM_SUBJECTIVE (UTMOS substitute — the `utmos` pip package conflicts with our fairseq monkey-patches). `examples/eval_mos.py`, runs on directory, outputs MOS + delta-vs-baseline per file. **Result: baseline MOS 4.05; 6 of 9 styles stay within 0.12 MOS of baseline; whisper/confused/anger degrade most. emo_sim + WER + MOS converge on the same three hardest styles — cross-metric triangulation validates the evaluation pipeline.** (April 17)
+- [ ] `[NOW]` Speaker novelty metric — show that generated speakers are genuinely different from source (EER or embedding distance). Per Joe (April 16 evening): this is a proof-of-novelty metric in our framing, not a privacy metric
+- [ ] `[SOON]` Speaker verification / privacy metric (secondary — Joe: "not sure we want to focus on privacy as the main thing")
+- [ ] `[SOON]` Ablation study: CREMA-D only vs. Expresso only vs. combined (already have data for this)
+- [ ] `[SOON]` Compare with naive baseline: random noise without style control
+- [ ] `[SOON]` Write up negative result: ControlVC D_VECTOR doesn't encode style (separability ratio 0.88)
+- [ ] `[LATER]` Investigate F0-based re-identification attack: can F0 alone re-identify speakers after embedding anonymization?
+- [ ] `[SOON]` Address collapse issue: 9% of speaker-style combinations produce unintelligible output (Joe says expected, no perfect fix needed, but worth tracking)
+
+### Phase 2.5: Framing-driven tasks (from Joe's April 16 evening message)
+
+Joe clarified that our problem is **controllable speaker generation for voice-to-voice** — a more general problem than VoicePrivacy (preserves emotion) or TTS (generates from text). The VAE enables multiple use cases; we've been showcasing only one.
+
+- [ ] `[SOON]` **Demo use case #2: emotion change without identity change** — modify style latent dims while keeping the rest of the embedding fixed. Same-sounding person, different mood. Needs a new inference mode in `openvoice_infer_controllable.py` (or a new script) that takes a *source* speaker and only perturbs style dims instead of re-sampling the whole latent.
+- [ ] `[SOON]` **Demo use case #3: fully random speaker with style control** — sample the VAE from prior (no source speaker reference) and apply style. Produces a brand-new speaker targeting a specific emotion.
+- [ ] `[SOON]` **Demo use case #4: random speaker near an existing one** — sample in a neighborhood of the source speaker's latent code (small sigma) rather than from the full prior.
+- [ ] `[LATER]` **Literature search for SOTA comparison** — what systems claim controllable speaker generation for voice-to-voice? Need to know what we're being compared against for Joe's "might be SOTA" claim. *Blocked: waiting for Joe's reply on April 17 message.*
+
+### Phase 3: Reproducibility & Collaboration (for Joe)
+- [x] Commit training script + inference CLI + README (done April 16, during call)
+- [x] Add evaluation script (`eval_emotion.py`) with README docs — reproducible emotion2vec + emo_sim pipeline (April 17)
+- [x] Add WER evaluation script (`eval_wer.py`) — Whisper + jiwer, drift-from-baseline mode by default, fixed-reference mode via `--reference-text` (April 17)
+- [x] Add MOS evaluation script (`eval_mos.py`) — torchaudio SQUIM_SUBJECTIVE, baseline-as-reference mode by default, fixed-reference mode via `--reference` (April 17)
+- [ ] `[NOW]` Share Expresso download instructions with Joe
+- [ ] `[NOW]` Ensure Joe can run extraction + training + inference from scratch
+- [ ] `[NOW]` Pin dependencies (fairseq compat, OpenVoice install steps)
+- [ ] `[SOON]` Add a checked-in OpenVoice constraints file or lockfile matching the tested `.venv` stack, so setup is copy-paste reproducible beyond the README version notes
+- [ ] `[SOON]` Add an automated smoke test for `openvoice_infer_controllable.py --source-dir` + manifest generation against cached local checkpoints
+
+### Phase 4: Application / Demo
+- [ ] `[LATER]` Design interactive demo: upload voice → choose style → choose privacy level → download anonymized output
+- [ ] `[LATER]` Web UI or CLI tool that non-researchers can use
+- [ ] `[LATER]` Real-time voice conversion mode (stretch goal)
+- [ ] `[LATER]` Package as installable tool (pip install dpvc or similar)
+
+### Key Insight from Joe (April 9 call)
+> "If I want to invent a completely hypothetical speaker who has some properties, everybody's bad at that. And in some ways, that's what we're trying to do. If we can make this work, that's a big deal."
+
+The paper contribution is **controllable speaker profile synthesis with formal privacy guarantees** — something the speech community has struggled with. Even without the privacy angle, demonstrating controllability over speaker profiles is itself a significant result.
+
+### 0.6 Pass 1 Closeout (April 28, branch `feat/openvoice-pipeline-stabilization`)
+
+- Reframed the public docs so **OpenVoice is the active controllable pipeline** and **ControlVC is the DP baseline / negative-result path for style control**
+- Added a temporary `vae_checkpoint_path` compatibility alias back to `dpvc.Anonymizer`, while keeping `vae_config` as the canonical interface for new docs and scripts
+- Extended `examples/openvoice_infer_controllable.py` with:
+  - `--source-dir` batch generation
+  - default JSONL manifest output (`generation_manifest.jsonl` for batch runs)
+  - manifest rows containing source path, output path, style, style strength, noise level, seed, checkpoint, and latent dims
+- Updated `README.md`, `examples/README.md`, `results/README.md`, `docs/using.md`, `docs/training.md`, and ControlVC docs to match the real interfaces and current project framing
+- Fixed packaging drift for the active OpenVoice path: `requests` is now a base dependency, `pandas` is included in the `expresso` extra
+
+**Verification**
+- `./.venv/bin/python examples/openvoice_infer_controllable.py --help` shows the new `--source-dir` and `--manifest` interface
+- `./.venv/bin/python` compatibility smoke test confirmed both `Anonymizer(..., vae_checkpoint_path=...)` and `Anonymizer(..., vae_config=...)` load the same VAE class successfully
+- Real OpenVoice run completed on a one-file source directory: baseline + 9 style outputs written to `/tmp/dpvc_pass1_out/`, with a 10-row manifest at `/tmp/dpvc_pass1_out/generation_manifest.jsonl`
+
+**FINDINGS.md review**
+- Reviewed after Pass 1. No new paper-facing scientific finding was added because this pass stabilized interfaces and reproducibility rather than producing new experimental evidence.
+
+---
+
+## 1. Project Overview
+
+**dpvc** is a Python library for **differentially private voice conversion** — it anonymizes a speaker's identity by passing their voice embedding through a VAE with calibrated DP noise, then reconstructs audio with a modified (anonymized) speaker embedding.
+
+The new work on `feat/controlvc` extends this with **controllable anonymization**: instead of just adding noise, we can steer specific perceptual attributes (happy, sad, whisper, etc.) in the anonymized output by manipulating labeled latent dimensions of the VAE.
+
+### Core Pipeline
+
+```
+Source Audio
+    │
+    ├─► HuBERT (content codes)  ─────────────────────────┐
+    ├─► D_VECTOR (256-dim speaker embedding) ─► VAE ─► modified embedding
+    ├─► YAAPT (F0 pitch contour)  ────────────────────────┤
+    │                                                      │
+    └──────────────────────────────────────────────────────┴─► CodeGenerator (vocoder) ─► Output Audio
+```
+
+**Key insight:** The VAE sits on the speaker embedding only. Content (HuBERT codes) and prosody (F0) pass through unchanged. So style control operates on *who it sounds like*, not *what they say*.
+
+---
+
+## 2. What Joe Near Built (Pre-April 2026)
+
+Joe's work existed on two branches:
+
+### `feat/controlvc` (merged ~Feb 2026)
+- Added `ControlVCWrapper` in `dpvc/controlvc.py` (~500 lines) wrapping the [control-vc](https://github.com/auspicious3000/control-vc) system
+- Integrated it alongside the existing `OpenVoiceWrapper`
+- **Issues found:** hardcoded device `"cuda"` (broke on Mac), hardcoded paths to Joe's machine, fixed `INPUT_DIM=256` in VAE
+
+### `controllable_vae` branch (never merged)
+- Upgraded `model_embedding_vae.py` with a more sophisticated architecture: GELU activations, LayerNorm, proper reparameterization trick, `control_features` dict for overriding specific latent dims at inference
+- Updated `utils.py` `train_autoencoder()` to support label-aware training (MSE loss on first K latent dims)
+- Updated `anonymizer.py` to use a `vae_config` dict pattern
+- All of this was OpenVoice-specific; needed porting to work with ControlVC
+
+---
+
+## 3. What We Built (April 8-9, 2026)
+
+### 3.1 Portability Fixes (commit `139d0e5`)
+- Changed ControlVC device default from `"cuda"` to auto-detect (`cuda` if available, else `cpu`)
+- Added `get_vae_config()` to `ControlVCWrapper` and abstract `Wrapper` base class
+- Fixed hardcoded paths in example scripts with argparse
+
+### 3.2 Controllable VAE Port (commit `dfcbf9a`)
+- Ported the full controllable VAE architecture from `origin/controllable_vae` into the main codebase
+- Made it work with ControlVC (was OpenVoice-only)
+- **Files replaced/updated:**
+  - `dpvc/model_embedding_vae.py` — full rewrite with GELU/LayerNorm encoder, reparameterization trick, control_features support
+  - `dpvc/anonymizer.py` — rewritten with vae_config dict pattern
+  - `dpvc/utils.py` — `train_autoencoder()` now supports `labels` dict and label MSE loss
+  - `dpvc/wrapper.py` — added `get_vae_config()` to abstract interface
+  - `dpvc/openvoice.py` — added `get_vae_config()` for consistency
+
+### 3.3 Expresso Pipeline (commit `dfcbf9a`)
+Three new example scripts for end-to-end controllable training:
+
+1. **`examples/controlvc_extract_expresso.py`** — Extracts ControlVC speaker embeddings from HuggingFace's `ylacombe/expresso` dataset
+   - Handles 11 styles: default, confused, enunciated, happy, laughing, sad, whisper, emphasis, essentials, longform, singing
+   - One-hot style encoding: active style = +1, all others = -1
+   - Workarounds for: xet storage stalls (`HF_HUB_DISABLE_XET=1`), torchcodec incompatibility (uses `soundfile` instead), dataset download failures (supports `--parquet-dir` for local cache)
+
+2. **`examples/controlvc_train_vae_expresso.py`** — Trains controllable VAE with style labels
+   - Default: latent_dims=16 (11 style + 5 free), lr=1e-6
+   - Label loss forces first 11 latent dims to match one-hot style encoding via MSE
+
+3. **`examples/controlvc_infer_controllable.py`** — CLI for controllable voice anonymization
+   - Maps style name to latent dim index
+   - Sets all 11 style dims at inference (active = style_value, others = -1) to match training encoding
+
+### 3.4 Fairseq/HuBERT Fix (April 9, uncommitted)
+The HuBERT content encoder is critical — without it, the vocoder produces unintelligible noise. Fairseq 0.12.2 is incompatible with Python 3.11 due to mutable dataclass defaults.
+
+**Fix applied in `dpvc/controlvc.py`:**
+- Before importing fairseq, monkey-patches `dataclasses._get_field` to convert mutable defaults to `default_factory` calls
+- Adds the control-vc repo to `sys.path` so `fairseq_feature_reader` is importable
+- Also patched `fairseq/dataclass/configs.py` (manual `field(default_factory=...)` edits) and `fairseq/dataclass/initialize.py` (handle MISSING defaults) and `fairseq/checkpoint_utils.py` (`weights_only=False`) — these are in the venv, not version-controlled
+
+**Result:** HuBERT now loads and produces real content codes, generating intelligible speech output.
+
+### 3.6 F0 Prosody-Based Style Control (April 9, branch `feat/f0-style-control`)
+
+After confirming that D_VECTOR embeddings don't carry style information (Section 5), pivoted to controlling style via F0 (pitch) manipulation.
+
+**Changes:**
+- `dpvc/controlvc.py` — Added `f0_transform` parameter to `inference()`. Supports three operations applied to the F0 contour before it hits the vocoder:
+  - `pitch_shift`: multiply voiced F0 values (e.g., 1.25 = 25% higher pitch)
+  - `range_scale`: expand/compress variation around mean (e.g., 0.4 = flatter, 2.0 = more expressive)
+  - `flatten`: interpolate toward mean (0.0 = no change, 1.0 = completely monotone)
+- `dpvc/anonymizer.py` — Pass-through `f0_transform` to wrapper
+- `dpvc/wrapper.py` — Added `f0_transform` to abstract base class signature
+- `examples/controlvc_infer_controllable.py` — Rewritten with F0 preset system and custom CLI args
+
+**F0 Presets (tuned for audible distinction):**
+| Style | pitch_shift | range_scale | flatten |
+|-------|------------|-------------|---------|
+| happy | 1.25 | 1.6 | — |
+| sad | 0.80 | 0.4 | 0.3 |
+| whisper | 0.90 | 0.2 | 0.8 |
+| confused | 1.10 | 2.0 | — |
+| laughing | 1.35 | 2.2 | — |
+| enunciated | 1.00 | 1.8 | — |
+
+**Result:** Audible style differences confirmed. Tested with trump_0.wav as source. Happy and laughing are clearly higher/more animated, sad is lower/flatter, whisper is near-monotone. The VC pipeline itself changes voice identity (expected — HuBERT quantization is lossy), but prosody differences are clearly distinguishable.
+
+### 3.5 Inference Style Encoding Fix (April 9, uncommitted)
+Original inference script only set 1 latent dim (the active style). But training used a full one-hot encoding (+1 active, -1 all others across all 11 dims). Fixed to set all 11 style dims at inference.
+
+---
+
+## 4. Training Runs
+
+### Extraction
+- **4,840 samples** extracted from Expresso dataset (5 of 6 shards — shard 5 consistently failed to download)
+- 256-dim speaker embeddings + 11-dim one-hot style labels
+- Saved to `embeddings/controlvc_expresso_emb.pt`
+- Took ~9.5 minutes on CPU
+
+### VAE Training
+
+| Version | System | Epochs | LR   | Final Label Loss | Final Recon | Notes |
+|---------|--------|--------|------|-----------------|-------------|-------|
+| v1      | ControlVC | 2000   | 1e-6 | ~590            | 0.20        | LR too low — label dims barely learned |
+| v2      | ControlVC | 2000   | 1e-4 | ~55             | 0.20        | 10x improvement, but still high |
+| v3      | ControlVC | 5000   | 1e-4 | ~15             | 0.18        | Label loss improved but no audible style diff — even at style_value=5.0 |
+| **v4**  | **OpenVoice** | **2000** | **1e-4** | **~1.5** | **~67** | **Label loss 10x lower than ControlVC v3! OpenVoice embeddings encode style.** |
+| **v5**  | **OpenVoice+CREMA-D** | **2000** | **1e-4** | **~18** | **~200** | **91 speakers, 6 emotions. Higher label loss but all emotions perceptually distinct.** |
+
+Checkpoints saved in `embeddings/`:
+- `controlvc_vae_expresso.pt` (v1, ControlVC)
+- `controlvc_vae_expresso_v2.pt` (v2, ControlVC)
+- `controlvc_vae_expresso_v3.pt` (v3, ControlVC)
+- `openvoice_vae_expresso.pt` (v4, OpenVoice+Expresso)
+- `openvoice_vae_cremad.pt` (v5, OpenVoice+CREMA-D — best diversity)
+
+---
+
+## 5. Current Status & Open Problems
+
+### Working
+- End-to-end pipeline: extract → train → infer produces intelligible speech
+- HuBERT content encoder loads and produces real content codes
+- VAE reconstructs speaker embeddings with low reconstruction loss
+- Style control infrastructure is in place
+
+### OpenVoice Style Control Results (April 9)
+
+Trained controllable VAE on 8,712 OpenVoice embeddings (v4, label loss ~1.5 — 10x better than ControlVC). Tested with `trump_0.wav` as source.
+
+**Perceptual evaluation (amplified style values):**
+
+| Style | x1 | x3 | x5 |
+|-------|-----|-----|-----|
+| whisper | Subtle | **Sounds like a real whisper** | Over-exaggerated, background noise |
+| sad | Subtle shift | **Sounds like an actual sad person** (some distortion) | Sad but more distortion |
+| happy | Barely perceptible | Jovial but close to original, not pronounced | Distorted, not clearly happy |
+| laughing | No difference | No meaningful change | Minimal change |
+
+**Key findings:**
+- **x3 is the sweet spot** — x5 overdrives the decoder and introduces artifacts
+- **Whisper and sad work well** via embedding alone — these styles map to strong tonal shifts (pitch reduction, energy reduction, flattening)
+- **Happy and laughing need F0 augmentation** — their perceptual qualities (pitch variation, rhythmic changes) aren't fully captured in the 256-dim embedding
+- **Massive improvement over ControlVC** — went from zero audible difference to clearly recognizable style shifts
+
+**Numerical differences from baseline (mean absolute):**
+| Style | x1 | x3 | x5 |
+|-------|-----|-----|-----|
+| whisper | 0.039 | 0.062 | **0.099** |
+| sad | 0.036 | 0.038 | 0.056 |
+| happy | 0.035 | 0.037 | 0.037 |
+| laughing | 0.034 | 0.030 | 0.037 |
+
+**F0 post-processing attempt (failed):**
+Tried combining embedding control (x3) with F0 pitch shifting on the output audio for happy (+20%), laughing (+30%), confused (+10%). Result: voices became thin and high-pitched, not perceptibly "happy" or "laughing." Pitch shifting output audio is too crude — happiness and laughter are speech *behaviors* (varied intonation patterns, rhythm, breath) not achievable by shifting a single signal. Reverted F0 post-processing from OpenVoice wrapper.
+
+**Conclusion:** Embedding-based style control with OpenVoice at x3 works for **tonal/energy styles** (whisper, sad) but not for **behavioral styles** (happy, laughing, confused). This is a meaningful finding — it reveals which style dimensions are capturable in a 256-dim speaker embedding vs. which require fundamentally different representations.
+
+### CREMA-D Results (April 10, branch `feat/cremad-experiments`)
+
+**Motivation:** Expresso has only 3 speakers — within-style variance is dominated by speaker identity, not emotion. Joe recommended maximizing speaker diversity. CREMA-D has 91 speakers × 6 emotions × ~13 sentences = 7,442 clips.
+
+**Dataset:** [AbstractTTS/CREMA-D on HuggingFace](https://huggingface.co/datasets/AbstractTTS/CREMA-D). 6 emotions: anger, disgust, fear, happy, neutral, sad. Extracted 1 sample per speaker per emotion = 546 samples (per Joe's recommendation).
+
+**Embedding separability (CREMA-D, 91 speakers):**
+
+| Emotion | Dist from global | Within var | Ratio |
+|---------|-----------------|------------|-------|
+| anger | 0.2712 | 0.4875 | 0.56 |
+| sad | 0.2006 | 0.4875 | 0.41 |
+| happy | 0.1612 | 0.4978 | 0.32 |
+| neutral | 0.1430 | 0.4768 | 0.30 |
+| disgust | 0.1483 | 0.4891 | 0.30 |
+| fear | 0.1321 | 0.5180 | 0.26 |
+
+Overall ratio: 0.54. Speaker separability: 1.43 (speakers are well-separated, emotions are not). The embedding fundamentally prioritizes speaker identity over emotion — but the VAE can still learn nonlinear separations.
+
+**Perceptual evaluation (x3 amplification):**
+
+| Emotion | Listener assessment |
+|---------|-------------------|
+| anger | Subtly different, noticeable |
+| disgust | Similar to anger |
+| fear | Distinct but not "fearful" — sounds timid |
+| happy | Does sound happy, especially toward end of speech |
+| neutral | Neutral sounding |
+| sad | Sounds sad |
+
+**Acoustic analysis (librosa F0/RMS/spectral centroid):**
+
+| Style | dF0 Mean | dF0 Std | dF0 Range | dEnergy | dBrightness |
+|-------|----------|---------|-----------|---------|-------------|
+| anger | +12.4 | +2.1 | -12.7 | -0.001 | +123 |
+| disgust | -38.9 | +3.8 | -36.3 | +0.001 | -27 |
+| fear | -5.4 | +7.6 | -6.4 | +0.017 | -26 |
+| happy | -14.4 | +12.1 | +12.9 | +0.001 | -49 |
+| neutral | -53.4 | +0.1 | -43.7 | +0.007 | -178 |
+| sad | -12.9 | +2.7 | -8.0 | +0.008 | -223 |
+
+**Key findings:**
+- **Happy has highest F0 variance** (+12.1 over baseline) — exactly matches how happy speech sounds
+- **Anger is brightest** (+123) — sharper, edgier tone
+- **Sad is darkest** (-223 brightness) — muffled, lower quality
+- **Neutral is flattest** (F0 std +0.1, smallest range) — monotone
+- **All 6 emotions are acoustically distinct** — major improvement over Expresso where only whisper/sad worked
+- **Speaker diversity matters more than label separability** — despite lower raw separability (0.54 vs 0.62), the VAE learned better generalizable patterns from 91 speakers vs 3
+
+**Why CREMA-D works better than Expresso for emotion control:**
+1. 91 speakers forces the VAE to find emotion patterns that generalize across voices
+2. 1 sample per speaker per emotion prevents speaker memorization
+3. CREMA-D emotions are acted with clear intent (professional actors), while Expresso styles are more subtle reading variations
+
+### Combined CREMA-D + Expresso Results (April 10, branch `feat/cremad-experiments`)
+
+**Motivation:** CREMA-D provides speaker diversity (91 speakers, 6 emotions) but lacks Expresso's unique styles (whisper, confused, enunciated). Combining both gives the best of both worlds: 9 unified labels with strong speaker diversity.
+
+**Unified label scheme (825 samples, ~90-94 per label):**
+- From CREMA-D (91 speakers): anger, disgust, fear, happy, neutral, sad
+- From Expresso (3 speakers, capped at 90): confused, enunciated, whisper
+- Shared (CREMA-D + 1-per-speaker Expresso): happy, neutral, sad
+
+**VAE v6:** 3000 epochs, lr=1e-4, latent_dims=15 (9 labels + 6 free). Final label loss ~30.
+
+**Acoustic analysis (all deltas from baseline):**
+
+| Style | dF0 Mean | dF0 Std | dF0 Range | dBrightness | Signature |
+|-------|----------|---------|-----------|-------------|-----------|
+| anger | +13.2 | -3.9 | +4.4 | +369 | Sharp, edgy |
+| confused | +2.5 | -3.8 | -29.5 | +111 | Hesitant, narrow range |
+| disgust | -25.6 | +6.1 | +24.3 | -156 | Low, withdrawn |
+| enunciated | -10.5 | -3.9 | +5.2 | +274 | Crisp, bright |
+| fear | +17.4 | +4.8 | +80.8 | -155 | Tense, variable |
+| **happy** | **+26.8** | **+33.2** | **+240.7** | +84 | **Most expressive — 2x F0 variation** |
+| neutral | -46.0 | +6.7 | +38.8 | -533 | Flat, subdued |
+| sad | -0.3 | +11.2 | +44.3 | -272 | Darker tone |
+| **whisper** | **-128.6** | **-30.8** | **-143.6** | **+945** | **F0 near zero, breathy/airy** |
+
+**Perceptual evaluation:** All 9 styles perceptually distinct and matching acoustic expectations. This is the first time happy has produced a convincingly animated output.
+
+**Why the combined model works:**
+1. **Speaker diversity from CREMA-D** (91 speakers) prevents speaker memorization
+2. **Style richness from Expresso** adds whisper/confused/enunciated which CREMA-D lacks
+3. **Balanced sampling** (~90 per label) prevents any label from dominating
+4. **Cross-dataset generalization** — the model learns emotion patterns that hold across two completely different recording conditions
+
+**Progression of results:**
+| Model | Dataset | Speakers | Working styles | Happy? |
+|-------|---------|----------|---------------|--------|
+| v1-v3 | Expresso+ControlVC | 3 | 0 | No |
+| v4 | Expresso+OpenVoice | 3 | 2 (whisper, sad) | No |
+| v5 | CREMA-D+OpenVoice | 91 | 6 (all emotions) | Yes |
+| **v6** | **Combined** | **94** | **9 (all labels)** | **Yes** |
+
+---
+
+### ControlVC Style Differentiation (April 9 — superseded by OpenVoice)
+**The core problem with ControlVC:** All style-controlled outputs sounded identical. Setting different style dims at inference produced no audible differences.
+
+**Root cause identified (April 9):** Empirical analysis of the raw 256-dim D_VECTOR embeddings shows that **styles are not separable in embedding space:**
+
+```
+Embedding analysis (4,840 samples, unit-normalized):
+  Between-style mean centroid distance: 0.0298
+  Within-style mean distance to centroid: 0.0338
+  Separability ratio: 0.88 (needs >>1)
+
+  Only whisper shows any separation (~0.06 from others)
+  All other styles overlap almost completely
+```
+
+**What this means:** The D_VECTOR speaker embedding encodes speaker identity, not speaking style. Within any given style, samples from different speakers vary MORE than the style signal itself. The VAE cannot learn to control what the input representation doesn't encode.
+
+**Style distribution in training data:**
+| Style | Samples | Notes |
+|-------|---------|-------|
+| default | 759 | |
+| confused | 760 | |
+| enunciated | 760 | |
+| happy | 760 | |
+| laughing | 557 | |
+| sad | 379 | |
+| whisper | 379 | Only style with measurable embedding separation |
+| emphasis | 400 | |
+| essentials | 80 | Too few samples |
+| longform | 3 | Effectively zero |
+| singing | 3 | Effectively zero |
+
+**Training code observation:** `beta = 1` and losses use `.sum()` not `.mean()`. Label loss sums over batch×11 dims = 2,816 terms. Recon loss sums over batch×256 dims = 65,536 terms. Label loss is inherently ~23x smaller in gradient magnitude — the VAE heavily prioritizes reconstruction over label matching.
+
+**Confirmed April 9:** Tested v3 checkpoint (label loss ~15) with style_value=5.0 (5x normal). Zero audible difference from baseline. The decoder is simply not sensitive to these latent dims because the input representation doesn't carry style.
+
+**Possible paths forward:**
+
+1. **Control F0 instead of (or in addition to) speaker embedding:** Style differences primarily manifest in prosody (F0 contour), not speaker identity. Applying the controllable VAE to F0 features might yield audible style differences. The ControlVC pipeline already extracts F0 via YAAPT — we just don't route it through the VAE.
+
+2. **Use a style-aware embedding model:** Replace D_VECTOR with an embedding model that jointly encodes identity and style (e.g., emotion-aware speaker encoder, or a multi-task model trained on both speaker ID and emotion).
+
+3. **Switch to per-speaker style control:** Within a single speaker, style differences may be more detectable (no cross-speaker variance to drown the signal). Train speaker-specific VAEs or condition on speaker ID.
+
+4. **Operate on a concatenated representation:** Instead of just the 256-dim speaker embedding, pass [speaker_emb; F0_stats; energy_stats] through the VAE. This gives the model more style-relevant information to work with.
+
+5. **~~Ask Joe~~** ✓ **ANSWERED (April 9):** Joe confirmed that OpenVoice embeds F0 in the speaker embedding, so style changes ARE audible with OpenVoice. He noted that some VC systems (like ControlVC) treat F0 separately, so randomizing the speaker embedding alone doesn't produce audible style changes. He also reminded us that the original plan was to use Expresso with **OpenVoice, not ControlVC**, since OpenVoice has been easier to work with.
+
+**Conclusion: Pivot to OpenVoice for controllable style work.** The ControlVC pipeline is still valuable for DP anonymization (speaker embedding noise), but style control requires a system where F0 is part of the embedding — which OpenVoice provides.
+
+---
+
+## 6. Architecture Details
+
+### VAE (model_embedding_vae.py)
+
+```
+Encoder: Linear(256→512) → GELU → LayerNorm(512) → Linear(512→256) → GELU → Linear(256→64) → GELU
+         → to_mu(64→latent_dim), to_logvar(64→latent_dim)
+
+Decoder: Linear(latent_dim→64) → GELU → Linear(64→256) → GELU → Linear(256→512) → GELU → Linear(512→256)
+```
+
+Reparameterization: `z = mu + eps * exp(0.5 * logvar)`
+
+DP noise (at inference): L2 clip → Gaussian noise → post-clip → clamp
+
+Control features: After computing z, override specific dims: `z[:, idx] = value`
+
+### Training Loss
+```
+total_loss = recon_loss + kl_loss + beta * label_mse_loss
+```
+Where:
+- `recon_loss`: MSE between input and reconstructed embedding
+- `kl_loss`: KL divergence (standard VAE)
+- `label_mse_loss`: MSE between first K latent dims and target labels (one-hot style encoding)
+- `beta`: weighting factor (needs investigation — see `dpvc/utils.py`)
+
+### Anonymizer Flow
+```python
+source_embedding = wrapper.extract_embedding(source_file)   # 256-dim
+target_embedding = VAE(source_embedding, noise, control_features)  # 256-dim (modified)
+wrapper.inference(source_file, output_file, source_embedding, target_embedding)
+```
+
+The `inference()` call uses the source audio's HuBERT codes and F0, but replaces the speaker embedding with the VAE-modified one.
+
+---
+
+## 7. File Inventory
+
+### Core Library (`dpvc/`)
+| File | Purpose | Status |
+|------|---------|--------|
+| `controlvc.py` | ControlVC wrapper (HuBERT, D_VECTOR, F0, vocoder) | Updated: fairseq compat patch, device auto-detect |
+| `anonymizer.py` | DP pipeline orchestrator | Rewritten: vae_config dict pattern |
+| `model_embedding_vae.py` | VAE model (GELU/LayerNorm, control_features) | Rewritten from controllable_vae branch |
+| `utils.py` | Training utilities (train_autoencoder with labels) | Updated: label-aware training |
+| `wrapper.py` | Abstract base class | Updated: get_vae_config() |
+| `openvoice.py` | OpenVoice wrapper | Updated: get_vae_config() |
+| `__init__.py` | Package exports | Unchanged |
+
+### Examples (`examples/`)
+| File | Purpose | Status |
+|------|---------|--------|
+| `controlvc_extract_expresso.py` | Extract embeddings from Expresso | New |
+| `controlvc_train_vae_expresso.py` | Train controllable VAE | New |
+| `controlvc_infer_controllable.py` | Controllable inference CLI | New, updated (full style encoding) |
+| `controlvc_extract_commonvoice.py` | CommonVoice extraction | Updated (argparse) |
+| `openvoice_inference.py` | OpenVoice demo | Updated (vae_config API) |
+
+### Generated Artifacts (not committed)
+| Path | Description |
+|------|-------------|
+| `embeddings/controlvc_expresso_emb.pt` | 4,840 extracted embeddings + style labels |
+| `embeddings/controlvc_vae_expresso.pt` | VAE v1 (lr=1e-6, 2000 epochs) |
+| `embeddings/controlvc_vae_expresso_v2.pt` | VAE v2 (lr=1e-4, 2000 epochs) |
+| `embeddings/controlvc_vae_expresso_v3.pt` | VAE v3 (lr=1e-4, 5000 epochs, in progress) |
+| `output/*.wav` | Generated audio samples |
+
+---
+
+## 8. Dependencies & Environment
+
+- **Python:** 3.11.9 (pyenv)
+- **PyTorch:** 2.0.1 (via venv)
+- **torchaudio:** 2.9.1
+- **fairseq:** 0.12.2 (requires dataclass monkey-patch for Python 3.11)
+- **hydra-core:** 1.3.2 (upgraded from 1.0.7)
+- **omegaconf:** 2.3.0 (upgraded from 2.0.6)
+- **control-vc repo:** `/Users/steve/repos/control-vc` (contains checkpoints and fairseq_feature_reader.py)
+- **HuBERT checkpoint:** `control-vc/checkpoints/hubert_base_ls960.pt`
+- **K-means model:** `control-vc/checkpoints/km.bin`
+- **Vocoder:** `control-vc/checkpoints/embed_f0stat2/g_00350000.pth`
+
+### Venv patches (not version-controlled)
+These files in `.venv/lib/python3.11/site-packages/fairseq/` were manually patched:
+- `dataclass/configs.py` — `field(default_factory=...)` for FairseqConfig fields
+- `dataclass/initialize.py` — handle MISSING defaults in hydra_init loop
+- `checkpoint_utils.py` — `weights_only=False` for torch.load
+
+---
+
+## 9. Lingering Questions
+
+### Resolved
+1. ~~**Is speaker embedding the right place to control style?**~~ **No.** Confirmed empirically — D_VECTOR doesn't encode style. Pivoted to F0.
+2. ~~**What is beta in the label loss?**~~ beta=1. Loss uses `.sum()` not `.mean()`, so label loss is ~23x smaller in gradient magnitude than recon loss.
+3. ~~**Should we validate style presence in embeddings first?**~~ Yes, did this. Separability ratio = 0.88 (not separable).
+
+### Still Open
+
+1. **How do we make F0 control DP-compatible?** This is the central research question. See Section 10.1 below for a full analysis.
+
+2. **Did the controllable VAE ever work with OpenVoice?** If Joe saw audible style differences with OpenVoice embeddings, the embedding-based approach might work for some VC systems. Need to ask.
+
+3. **What F0 features are speaker-identifying?** Mean F0, F0 range, speaking rate, and intonation patterns are all biometric. We need to understand which features carry identity vs. style to design the right DP mechanism.
+
+4. **Do we need the full 11 styles?** Longform (3 samples) and singing (3 samples) are useless. essentials (80 samples) is marginal. Probably should reduce to 7-8 styles for cleaner training.
+
+5. **How does the noise budget compose across embedding + F0?** If we apply DP noise to both speaker embedding AND F0 features, the total privacy cost combines via composition. What's the right epsilon split?
+
+---
+
+## 10. Novelty & Paper Potential
+
+### 10.1 The DP-Compatible F0 Problem (Key Research Question)
+
+**The gap in our current system:**
+
+Right now, the pipeline has two independent pieces:
+1. **Speaker embedding** → VAE + DP noise → anonymized embedding (has formal privacy guarantee)
+2. **F0 contour** → deterministic preset transform → modified pitch (no privacy guarantee)
+
+The F0 contour is **biometric information**. Your pitch patterns, speaking rhythm, and intonation help identify you as a speaker. If we apply DP noise to the speaker embedding but leave F0 unprotected (or only apply a fixed transform), an attacker could potentially re-identify the speaker from the F0 alone — defeating the privacy guarantee on the embedding side.
+
+**What "DP-compatible F0 control" means:**
+
+Instead of hardcoded presets like `pitch_shift=1.25`, we need a mechanism where:
+- F0 features pass through a noise mechanism with a formal privacy guarantee (calibrated epsilon)
+- Style can still be controlled by overriding specific dimensions
+- The privacy budget accounts for BOTH the embedding and F0 channels
+
+**Three possible approaches:**
+
+**Approach A: F0 statistics through the existing VAE**
+- Extract F0 summary statistics (mean, std, range, slope) from the source audio — say 4-6 features
+- Concatenate with the 256-dim speaker embedding → 260-262 dim input
+- Train a single VAE on the combined representation
+- First K latent dims → style labels (which now have F0 information to learn from!)
+- DP noise applied to the full latent space
+- Reconstruct both embedding + F0 stats; use reconstructed stats to reshape the F0 contour
+- **Pros:** Single privacy budget, one model, clean architecture
+- **Cons:** Different feature scales (unit-norm embedding vs. raw Hz F0), may need careful normalization
+
+**Approach B: Separate F0 VAE**
+- Train a second, smaller VAE specifically on F0 statistics
+- Apply DP noise independently to each VAE
+- Use composition theorem to compute total privacy cost
+- **Pros:** Each model focuses on its domain, easier to tune
+- **Cons:** Two models to maintain, composition increases total epsilon
+
+**Approach C: Direct DP mechanism on F0 statistics**
+- Skip the VAE for F0 — just L2-clip the F0 stats and add Gaussian noise directly
+- Override specific stats with style targets (e.g., force mean_F0 = 200 Hz for "happy")
+- Reconstruct the F0 contour from the (noisy) stats
+- **Pros:** Simplest implementation, no training needed
+- **Cons:** Less expressive, can't learn nuanced style-identity disentanglement
+
+**Recommendation for discussion with Joe:** Approach A is the most elegant and publishable — a single VAE that jointly protects identity (via embedding) and prosody (via F0 stats), while separating style-controllable dims from identity dims. The key experiment would be: do F0 statistics form separable clusters by style? (We already know the answer should be yes — we proved F0 control produces audible differences.)
+
+**What this enables for a paper:**
+- "We show that speaker embeddings alone are insufficient for style control (separability ratio 0.88)"
+- "We demonstrate that F0 prosody features carry style information that is audibly distinguishable"
+- "We propose a joint embedding-prosody VAE that provides DP guarantees across both channels while maintaining controllable style"
+- This would be a genuine contribution — most voice anonymization work ignores prosody as an identity channel
+
+---
+
+### Core Contribution
+**Controllable differentially private voice conversion** — to our knowledge, no prior work combines:
+- Differential privacy on speaker embeddings
+- Explicit control over perceptual attributes in the anonymized output
+- A VAE architecture that separates controllable (labeled) and free (identity/privacy) latent dimensions
+
+### Why This Matters
+Standard DP voice anonymization destroys all speaker information indiscriminately. Controllable DP-VC lets you *choose* what to reveal: "anonymize the speaker, but make them sound happy" or "anonymize but preserve the emotional tone." This has applications in:
+- **Call center anonymization:** Strip identity but preserve customer sentiment
+- **Witness protection recordings:** Change voice but maintain emotional authenticity
+- **Accessible media:** Re-voice content with specific style properties
+
+### Paper Framing Ideas
+1. **"Prosody-Aware Differentially Private Voice Conversion"** — argue that existing DP voice anonymization leaks identity through prosody, then show a joint embedding+F0 mechanism that protects both while enabling style control
+2. **"Controllable Differential Privacy for Voice Conversion"** — frame as a privacy-utility tradeoff where "utility" includes perceptual style control
+3. **"Expressive Voice Anonymization with Formal Privacy Guarantees"** — emphasize the practical application angle
+
+### Key Results to Include
+- **Negative result (important!):** Speaker embeddings (D_VECTOR) don't encode style. Separability ratio = 0.88. This is worth reporting — it tells the community that embedding-only style control doesn't work.
+- **Positive result:** F0 prosody manipulation produces audibly distinct styles through the VC pipeline. Style lives in prosody, not in speaker embeddings.
+- **The gap:** F0 is an unprotected identity channel in current DP voice anonymization systems.
+
+### What Would Strengthen a Paper
+- **Quantitative style evaluation:** Use a pretrained emotion classifier on outputs to measure if controlled styles are detectable
+- **Speaker verification experiments:** Show that anonymization actually reduces speaker re-identification accuracy
+- **F0-based re-identification attack:** Demonstrate that F0 alone can re-identify speakers even after embedding anonymization — this motivates the need for F0 protection
+- **Privacy-utility curves:** Plot speaker verification accuracy vs. emotion classification accuracy at different noise levels, for embedding-only vs. embedding+F0 protection
+- **Joint VAE ablation:** Compare Approach A (joint VAE) vs. Approach B (separate VAEs) vs. Approach C (direct mechanism)
+- **Comparison with naive approach:** Show that just adding noise (without style control) cannot achieve the same style preservation
+
+### Related Work to Position Against
+- Voice Privacy Challenge (VPC) 2020-2024 — DP voice anonymization baselines (embedding-only, no F0 protection)
+- SpeechFlow / NaturalSpeech — controllable speech synthesis (but no privacy)
+- FHVAE / SpeechSplit — disentangled speech representations (but no DP)
+- Prosody-based speaker recognition literature — establishes that F0 IS identifying (motivates our work)
+
+---
+
+## 11. Meeting Prep Notes (for April 10)
+
+### What to Demo
+- The end-to-end pipeline works: extract → train → infer → audible speech
+- HuBERT content encoder is now functional (was broken, produced noise)
+- **F0-based style control produces audible differences** (validated April 9)
+- Can play back original Trump audio → baseline VC → happy/sad/whisper/laughing variants
+
+### What to Discuss
+- **ControlVC D_VECTOR doesn't encode style** — confirmed empirically (separability ratio 0.88) and by Joe ("some systems treat F0 differently"). Not a dead end for the project, just the wrong VC system for style control.
+- **OpenVoice is the right target for controllable style** — its embedding captures F0/prosody, so the controllable VAE should produce audible differences. This was the original plan per Joe.
+- **Concrete next step:** Re-run the Expresso extraction + VAE training pipeline with OpenVoice instead of ControlVC. The code is already set up for this (`dpvc/openvoice.py` has `get_vae_config()`).
+- **F0 prosody control on ControlVC works as a fallback** — we proved direct F0 manipulation produces audible style differences. This could still be useful for ControlVC-based anonymization even if the VAE-based approach moves to OpenVoice.
+- **DP question for longer term:** If we do both embedding-based style control (OpenVoice) AND F0 manipulation, how do the privacy budgets compose? Is there a unified approach?
+- **The fairseq compat situation** is fragile (monkey-patches for Python 3.11). Worth discussing whether to pin Python 3.10 or migrate to torchaudio's HuBERT.
+
+### Joe's Feedback (April 9, pre-meeting)
+> "Some systems treat f0 differently and so randomizing the speaker embedding doesn't sound like a big change. OpenVoice embeds the f0 profile in the speaker embedding, so with OpenVoice you do tend to get audible differences. I think we discussed doing that with OpenVoice (not ControlVC) since OpenVoice has been the easiest to work with in general."
+
+**Implication:** The controllable VAE architecture is sound — the problem is ControlVC's D_VECTOR, not the approach. Switching to OpenVoice should produce audible style differences because its embedding captures F0/prosody.
+
+### Prior Meeting Notes (March 18 call — Expresso plan)
+
+**The agreed-upon plan was always OpenVoice + Expresso:**
+1. **Extraction:** Read Expresso wav files, extract speaker embeddings using the OpenVoice wrapper, save embeddings + style labels to a .pt file
+   - Reference: `controllable_vae` branch → `examples/openvoice_extract_commonvoice_features.py`
+2. **Train VAE with labeled features:** For K labeled features, force the first K latent dims to match the labels via MSE loss during training
+   - Reference: `controllable_vae` branch → `examples/openvoice_train_vae_features.py`
+
+**The mechanism (from Joe):**
+- Latent representation has N dimensions (e.g., 8)
+- If we have K labeled features (e.g., age, gender, accent), the first K dims are forced to equal those labels
+- Training loss includes MSE between feature values and corresponding latent dims
+- At inference, override those K dims to control the output
+
+**Key URLs from that meeting:**
+- Expresso dataset: https://speechbot.github.io/expresso/
+- NaturalSpeech3 extraction example: `controllable_vae` branch → `examples/naturalspeech3_extract_commonvoice...`
+- OpenVoice extraction: `controllable_vae` branch → `examples/openvoice_extract_commonvoice_features.py`
+- OpenVoice VAE training: `controllable_vae` branch → `examples/openvoice_train_vae_features.py`
+
+**What we did instead:** Built the pipeline for ControlVC (which we now know doesn't embed F0 in the speaker embedding). Need to redo with OpenVoice as originally planned.
+
+### 3.7 OpenVoice Expresso Extraction (April 9, branch `feat/openvoice-expresso`)
+
+Per Joe's feedback and the original plan, pivoted to extracting Expresso embeddings with OpenVoice instead of ControlVC. OpenVoice embeds F0/prosody in its speaker embedding, so style control via the controllable VAE should produce audible differences.
+
+**Changes:**
+- `dpvc/openvoice.py` — Rewrote `extract_embedding()` to call `tone_color_converter.extract_se()` directly, bypassing OpenVoice's `get_se()` which runs VAD-based audio splitting. The VAD splitting asserts `num_splits > 0` (requires ~10s of speech after VAD), causing 75% of short Expresso utterances to fail. Direct extraction works on any length audio.
+- `dpvc/__init__.py` — Uncommented and fixed OpenVoiceWrapper export
+- `examples/openvoice_extract_expresso.py` — New extraction script adapted from ControlVC version:
+  - Uses pandas for parquet loading (bypasses HuggingFace datasets library issues with incomplete cache)
+  - Falls back to HuggingFace `load_dataset` when no `--parquet-dir` given
+  - Same one-hot style encoding (+1 active, -1 others) across 11 styles
+  - ~15-20 samples/sec on CPU (faster than ControlVC extraction)
+
+**Extraction results:**
+- 8,712 total samples across 9 parquet files (only 9 of 12 downloaded — 3 missing shards)
+- 0% skip rate (vs. 75% with the old VAD-based extraction)
+- Embedding shape: [N, 256, 1] (256-dim, same as ControlVC D_VECTOR)
+- Saved to `embeddings/openvoice_expresso_emb.pt`
+
+**Installation notes:**
+- OpenVoice installed via `pip install git+https://github.com/myshell-ai/OpenVoice --no-deps` (PyPI package doesn't exist)
+- Manual deps: unidecode, inflect, cn2an, pypinyin, jieba, eng_to_ipa, langid, whisper-timestamped
+- Checkpoint auto-downloads to `~/.cache/openvoice_checkpoint/` (122 MB)
+
+### What's Committed vs. Uncommitted
+- **Committed (feat/controlvc):** VAE port, Expresso pipeline scripts, portability fixes, .gitignore
+- **Committed (feat/f0-style-control):** F0 transform in controlvc.py, anonymizer.py, wrapper.py; inference CLI with F0 presets; fairseq Python 3.11 compat patch
+- **Committed (feat/openvoice-expresso):** OpenVoice extraction script (initial version)
+- **Not committed:** openvoice.py VAD bypass fix, __init__.py export fix, extraction script pandas update, WORKLOG.md, generated artifacts
+
+---
+
+## 12. Reproduction Commands
+
+```bash
+# Setup
+cd /Users/steve/UVM-plaid/dp-vc
+source .venv/bin/activate
+export PYTHONPATH=/Users/steve/UVM-plaid/dp-vc
+
+# ===== OpenVoice Pipeline (recommended) =====
+
+# 1. Extract OpenVoice embeddings from Expresso (~8 min on CPU)
+python examples/openvoice_extract_expresso.py \
+  --output embeddings/openvoice_expresso_emb.pt \
+  --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/9fb79a189698de3255eff48edd2bc0d9e487adc0/read
+
+# 2. Train controllable VAE on OpenVoice embeddings
+python examples/controlvc_train_vae_expresso.py \
+  --embeddings embeddings/openvoice_expresso_emb.pt \
+  --output embeddings/openvoice_vae_expresso.pt \
+  --epochs 2000 --lr 1e-4
+
+# 3. Run controllable inference (TODO: adapt for OpenVoice)
+# python examples/openvoice_infer_controllable.py ...
+
+# ===== ControlVC Pipeline (F0 style control) =====
+
+# 1. Extract ControlVC embeddings from Expresso (~10 min on CPU)
+export HF_HUB_DISABLE_XET=1
+python examples/controlvc_extract_expresso.py \
+  --repo-root /Users/steve/repos/control-vc \
+  --output embeddings/controlvc_expresso_emb.pt \
+  --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/9fb79a189698de3255eff48edd2bc0d9e487adc0/read
+
+# 2. Run F0-based style inference (no VAE needed for F0 control)
+python examples/controlvc_infer_controllable.py \
+  --repo-root /Users/steve/repos/control-vc \
+  --source examples/trump_0.wav \
+  --out output/trump_happy.wav \
+  --style happy --noise-level 0.5
+```
diff --git a/docs/controlvc_setup.md b/docs/controlvc_setup.md
index 92d2358..4c12567 100644
--- a/docs/controlvc_setup.md
+++ b/docs/controlvc_setup.md
@@ -1,5 +1,9 @@
 # ControlVC Wrapper Setup Guide
 
+ControlVC is preserved here as a **DP baseline and wrapper smoke-test path**.
+For the main controllable-style pipeline, use OpenVoice instead; see
+`examples/README.md`.
+
 ## Quick Setup (Recommended)
 
 Run the provided setup script — it handles everything automatically:
@@ -175,7 +179,9 @@ vc_wrapper = ControlVCWrapper(
 )
 
 # Pass the ControlVC-specific VAE checkpoint
-anonymizer = Anonymizer(vc_wrapper, vae_checkpoint_path="examples/controlvc_vae.pt")
+vae_config = vc_wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "examples/controlvc_vae.pt"
+anonymizer = Anonymizer(vc_wrapper, vae_config=vae_config)
 
 anonymizer.anonymize(
     source_file="examples/trump_0.wav",
diff --git a/docs/controlvc_wrapper.md b/docs/controlvc_wrapper.md
index 4199986..2c33a48 100644
--- a/docs/controlvc_wrapper.md
+++ b/docs/controlvc_wrapper.md
@@ -1,5 +1,9 @@
 # ControlVC Wrapper Documentation
 
+ControlVC is preserved here as a **DP baseline and wrapper reference**. It is
+not the recommended path for controllable style work; the active style-control
+pipeline uses OpenVoice (see `examples/README.md`).
+
 ## Overview
 
 The `ControlVCWrapper` provides a clean, two-stage API for integrating the ControlVC voice conversion system into the differential privacy anonymization pipeline.
@@ -119,7 +123,9 @@ vc_wrapper = ControlVCWrapper(
 )
 
 # Pass the ControlVC-specific VAE checkpoint
-anonymizer = Anonymizer(vc_wrapper, vae_checkpoint_path="examples/controlvc_vae.pt")
+vae_config = vc_wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "examples/controlvc_vae.pt"
+anonymizer = Anonymizer(vc_wrapper, vae_config=vae_config)
 
 anonymizer.anonymize(
     source_file="source.wav",
@@ -229,7 +235,9 @@ The wrapper is designed to work seamlessly with the differential privacy pipelin
 embedding = wrapper.extract_embedding(source_wav)
 
 # 2. Apply DP noise (via VAE + Anonymizer)
-anonymizer = Anonymizer(wrapper, vae_checkpoint_path="examples/controlvc_vae.pt")
+vae_config = wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "examples/controlvc_vae.pt"
+anonymizer = Anonymizer(wrapper, vae_config=vae_config)
 anonymizer.anonymize(source_wav, output_wav, noise_level=1.0)
 ```
 
diff --git a/docs/quick_reference.md b/docs/quick_reference.md
index 7c917d0..880f9e1 100644
--- a/docs/quick_reference.md
+++ b/docs/quick_reference.md
@@ -1,4 +1,8 @@
-# ControlVC Wrapper - Quick Reference
+# ControlVC Baseline - Quick Reference
+
+ControlVC is kept in this repo as a **DP baseline and wrapper reference**.
+For controllable style work, use the OpenVoice pipeline in
+`examples/README.md`.
 
 ## Installation
 
@@ -44,7 +48,9 @@ wrapper = ControlVCWrapper(
     repo_root=Path("~/repos/control-vc").expanduser(),
     device="cpu"
 )
-anonymizer = Anonymizer(wrapper, vae_checkpoint_path="examples/controlvc_vae.pt")
+vae_config = wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "examples/controlvc_vae.pt"
+anonymizer = Anonymizer(wrapper, vae_config=vae_config)
 
 anonymizer.anonymize(
     "source.wav",
diff --git a/docs/training.md b/docs/training.md
index 1a5e36a..3d5d6c1 100644
--- a/docs/training.md
+++ b/docs/training.md
@@ -46,11 +46,15 @@ torch.save(AE.state_dict(), 'example_openvoice_vae.pt')
 
 ## Using the new Autoencoder
 
-To use the trained VAE, pass the filename used when saving the model
-weights in the optional `vae_checkpoint_path` argument when
-constructing the anonymizer:
+To use the trained VAE, start from the wrapper's config dict and
+override the checkpoint path before constructing the anonymizer:
 
 ``` py
 # Construct the anonymizer, using the newly trained VAE
-anonymizer = dpvc.Anonymizer(vc_wrapper, vae_checkpoint_path='example_openvoice_vae.pt')
+vae_config = vc_wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "example_openvoice_vae.pt"
+anonymizer = dpvc.Anonymizer(vc_wrapper, vae_config=vae_config)
 ```
+
+`vae_checkpoint_path=` is still supported as a temporary compatibility
+alias for older examples, but `vae_config` is the canonical interface.
diff --git a/docs/using.md b/docs/using.md
index 405a7b1..5a52c4c 100644
--- a/docs/using.md
+++ b/docs/using.md
@@ -31,6 +31,18 @@ anonymizer.anonymize(src_path, output_path, noise_level=1.0)
 Here, `src_path` should be an input .wav file name, and `output_path`
 should be the output .wav file name. The `noise_level` parameter
 controls how much noise is added in the differential privacy step. The
-`OpenVoiceDPWrapper` object encapsulates the OpenVoice models, and the
+`OpenVoiceWrapper` object encapsulates the OpenVoice models, and the
 `anonymize` method performs the anonymization via differential
 privacy.
+
+If you want to use a custom VAE checkpoint, the canonical interface is
+to start from the wrapper's config dict and override the checkpoint:
+
+``` py
+vae_config = vc_wrapper.get_vae_config()
+vae_config["checkpoint_path"] = "example_openvoice_vae.pt"
+anonymizer = dpvc.Anonymizer(vc_wrapper, vae_config=vae_config)
+```
+
+`vae_checkpoint_path=` is still accepted as a temporary compatibility
+alias for older scripts, but new code should use `vae_config`.
diff --git a/dpvc/__init__.py b/dpvc/__init__.py
index 9edd631..d9c2937 100644
--- a/dpvc/__init__.py
+++ b/dpvc/__init__.py
@@ -1,9 +1,10 @@
 from .wrapper import *
 from .anonymizer import *
-# from .openvoice import *
+from .openvoice import OpenVoiceWrapper
 from .controlvc import ControlVCWrapper
 from . import model_embedding_vae
 from .model_embedding_vae import *
 
-__all__ = ["ControlVCWrapper"]
+__all__ = ["ControlVCWrapper", "OpenVoiceWrapper"]
 __version__ = "0.2.0"
+
diff --git a/dpvc/anonymizer.py b/dpvc/anonymizer.py
index aa192e8..39dcd89 100644
--- a/dpvc/anonymizer.py
+++ b/dpvc/anonymizer.py
@@ -1,35 +1,42 @@
 import torch
-import os
-import random
 
 from .model_embedding_vae import VariationalAutoencoder
 from . import utils
 
 class Anonymizer:
-    def __init__(self, vc_wrapper, vae_checkpoint_path=None):
+    def __init__(self, vc_wrapper, vae_config=None, vae_checkpoint_path=None):
+        """Create an anonymizer for a wrapper-backed VC system.
+
+        `vae_config` is the canonical interface. `vae_checkpoint_path` is kept
+        as a temporary compatibility alias so older examples continue to work
+        while the docs migrate to the config-dict pattern.
+        """
         device="cuda:0" if torch.cuda.is_available() else "cpu"
 
         self.vc_wrapper = vc_wrapper
 
-        if vae_checkpoint_path is None:
-            local_path = os.path.dirname(os.path.abspath(__file__))
-            ae_path = f'{local_path}/openvoice_embedding_vae.pt'
+        if vae_config is None:
+            vae_config = vc_wrapper.get_vae_config()
         else:
-            ae_path = vae_checkpoint_path
+            vae_config = dict(vae_config)
+
+        if vae_checkpoint_path is not None:
+            vae_config['checkpoint_path'] = vae_checkpoint_path
+
+        ae_path = vae_config['checkpoint_path']
 
-        AE = VariationalAutoencoder(latent_dims=6).to(device)
+        AE = VariationalAutoencoder(latent_dims=vae_config['latent_dim'],
+                                    input_dim=vae_config['input_dim'],
+                                    clip_threshold=vae_config['clip_threshold'],
+                                    post_clip_threshold=vae_config['post_clip_threshold']
+                                    ).to(device)
         AE.load_state_dict(torch.load(ae_path, weights_only=True, map_location=device))
         AE.eval()
-        AE.clip_threshold = 3.0
         self.AE = AE
 
-        # Only needed if we also want to select a random speaker to start from
-        # emb_path = f'{local_path}/openvoice_random_embeddings_cv.pt'
-        # emb = torch.load(emb_path).to(device).squeeze()
-        # self.emb = emb
-
     @torch.inference_mode()
-    def anonymize(self, source_file, output_file, noise_level, seed=None):
+    def anonymize(self, source_file, output_file, noise_level, seed=None,
+                  control_features=None):
         """Anonymize the source file, using the specified noise level, writing
         to the output file"""
         self.AE.set_noise_mult(noise_level)
@@ -37,15 +44,9 @@ def anonymize(self, source_file, output_file, noise_level, seed=None):
         utils.set_seed(seed)
 
         source_embedding = self.vc_wrapper.extract_embedding(source_file)
-
-        # Only needed if we also want to select a random speaker to start from
-        # num_emb, dim_emb = self.emb.shape
-        # idx = random.randint(0, num_emb)
-        # random_embedding = self.emb[idx].unsqueeze(0)
-        # target_embedding = random_embedding.unsqueeze(-1) # (to just use a random embedding)
-        # target_embedding = self.AE(random_embedding, seed=seed).unsqueeze(-1)
-
-        target_embedding = self.AE(source_embedding.squeeze(-1), seed=seed).unsqueeze(-1)
+        target_embedding = self.AE(source_embedding.squeeze(-1),
+                                   seed=seed,
+                                   control_features=control_features)
 
         self.vc_wrapper.inference(
             source_file,
diff --git a/dpvc/controlvc.py b/dpvc/controlvc.py
index a75ddc9..cffd36d 100644
--- a/dpvc/controlvc.py
+++ b/dpvc/controlvc.py
@@ -33,7 +33,7 @@ class ControlVCWrapper:
     def __init__(
         self,
         repo_root: Path,
-        device: str = "cuda",
+        device: str = None,
         checkpoints_dir: Optional[Path] = None,
         config: Optional[Dict[str, Any]] = None,
         verbose: bool = False,
@@ -41,11 +41,10 @@ def __init__(
         self.repo_root = Path(repo_root).expanduser().resolve()
         self.verbose = verbose
 
-        # Set device
-        if device == "cpu" or not torch.cuda.is_available():
-            self.device = torch.device("cpu")
-        else:
-            self.device = torch.device(device)
+        # Set device: auto-detect if not specified
+        if device is None:
+            device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.device = torch.device(device)
 
         self.checkpoints_dir = (Path(checkpoints_dir).resolve()
                                 if checkpoints_dir else (self.repo_root / "checkpoints"))
@@ -363,6 +362,16 @@ def inference(self, source_file: Union[str, Path], output_file: Union[str, Path]
         audio_np = audio_out.cpu().squeeze().numpy()
         sf.write(output_file, audio_np, 16000)
 
+    def get_vae_config(self):
+        """Return default VAE configuration for ControlVC."""
+        return {
+            'checkpoint_path': None,  # must be provided by user
+            'latent_dim': 6,
+            'input_dim': 256,
+            'clip_threshold': 10.0,
+            'post_clip_threshold': 10.0,
+        }
+
     # ---------- Helper Methods ----------
     def _ensure_tensor(self, x: Any) -> torch.Tensor:
         """Convert input to torch.Tensor on correct device."""
diff --git a/dpvc/model_embedding_vae.py b/dpvc/model_embedding_vae.py
index 59953b0..06c9561 100644
--- a/dpvc/model_embedding_vae.py
+++ b/dpvc/model_embedding_vae.py
@@ -5,53 +5,62 @@
 import numpy as np
 from . import utils
 
-INPUT_DIM = 256
+
+class Encoder(nn.Module):
+    def __init__(self, input_dim=256, latent_dim=6):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, 512),
+            nn.GELU(),
+            nn.LayerNorm(512),
+
+            nn.Linear(512, 256),
+            nn.GELU(),
+
+            nn.Linear(256, 64),
+            nn.GELU(),
+        )
+
+        self.to_mu = nn.Linear(64, latent_dim)
+        self.to_logvar = nn.Linear(64, latent_dim)
+
+    def forward(self, x):
+        h = self.net(x)
+        mu = self.to_mu(h)
+        logvar = self.to_logvar(h)
+        return mu, logvar
+
 
 class Decoder(nn.Module):
-    def __init__(self, latent_dims):
-        super(Decoder, self).__init__()
-        self.linear1 = nn.Linear(latent_dims, 32)
-        self.linear2 = nn.Linear(32, 64)
-        self.linear3 = nn.Linear(64, 128)
-        self.linear4 = nn.Linear(128, INPUT_DIM)
+    def __init__(self, latent_dim=6, output_dim=256):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(latent_dim, 64),
+            nn.GELU(),
+
+            nn.Linear(64, 256),
+            nn.GELU(),
+
+            nn.Linear(256, 512),
+            nn.GELU(),
+
+            nn.Linear(512, output_dim),
+        )
 
     def forward(self, z):
-        z = F.relu(self.linear1(z))
-        z = F.relu(self.linear2(z))
-        z = F.relu(self.linear3(z))
-        z = self.linear4(z)
-        return z
-
-class VariationalEncoder(nn.Module):
-    def __init__(self, latent_dims):
-        super(VariationalEncoder, self).__init__()
-        self.linear1 = nn.Linear(INPUT_DIM, 128)
-        self.linear11 = nn.Linear(128, 64)
-        self.linear12 = nn.Linear(64, 32)
-        self.linear2 = nn.Linear(32, latent_dims)
-        self.linear3 = nn.Linear(32, latent_dims)
-
-        self.N = torch.distributions.Normal(0, 1)
-        self.kl = 0
+        return self.net(z)
 
-    def forward(self, x):
-        #x = torch.flatten(x, start_dim=1)
-        x = F.relu(self.linear1(x))
-        x = F.relu(self.linear11(x))
-        x = F.relu(self.linear12(x))
-        mu =  self.linear2(x)
-        sigma = torch.exp(self.linear3(x))
-        z = mu + sigma*self.N.sample(mu.shape).to(x.device)
-        self.kl = (sigma**2 + mu**2 - torch.log(sigma) - 1/2).sum()
-        return z
 
 class VariationalAutoencoder(nn.Module):
-    def __init__(self, latent_dims):
-        super(VariationalAutoencoder, self).__init__()
-        self.encoder = VariationalEncoder(latent_dims)
-        self.decoder = Decoder(latent_dims)
+    def __init__(self, input_dim=256, latent_dims=6,
+                 clip_threshold=10.0,
+                 post_clip_threshold=10.0):
+        super().__init__()
+        self.encoder = Encoder(input_dim, latent_dims)
+        self.decoder = Decoder(latent_dims, input_dim)
         self.noise_mult = None
-        self.clip_threshold = 5.0
+        self.clip_threshold = clip_threshold
+        self.post_clip_threshold = post_clip_threshold
 
     def set_noise_mult(self, noise_mult):
         self.noise_mult = noise_mult
@@ -61,17 +70,30 @@ def clip_by_l2norm(self, x, threshold):
         scaling = torch.clamp(threshold / (norms + 1e-8), max=1.0)
         return x * scaling
 
-    def forward(self, x, seed=None):
-        z = self.encoder(x)
+    def reparameterize(self, mu, logvar):
+        std = torch.exp(0.5 * logvar)
+        eps = torch.randn_like(std)
+        return mu + eps * std
+
+    def forward(self, x, seed=None, control_features=None):
+        mu, logvar = self.encoder(x)
+        z = self.reparameterize(mu, logvar)
+        self.last_z = z
+
         if self.noise_mult:
-            #print('******************* input max/min/norm:', x.max().item(), x.min().item(), x.norm(p=2).item())
-            #print('******************* latent max/min/norm:', z.max().item(), z.min().item(), z.norm(p=2).item())
             z = self.clip_by_l2norm(z, self.clip_threshold)
-            #print('******************* clipped max/min/norm:', z.max().item(), z.min().item(), z.norm(p=2).item())
+
             utils.set_seed(seed)
-            z = z + self.clip_threshold*self.noise_mult*torch.randn(z.shape).to(z.device)
-            #print('******************* noisy max/min/norm:', z.max().item(), z.min().item(), z.norm(p=2).item())
-            z = torch.clamp(z, min=-self.clip_threshold, max=self.clip_threshold)
-            #z = self.clip_by_l2norm(z, 10*self.clip_threshold)
-        return self.decoder(z)
+            z = z + self.clip_threshold * self.noise_mult * torch.randn(z.shape).to(z.device)
+
+            z = self.clip_by_l2norm(z, 2 * self.clip_threshold)
+            z = torch.clamp(z, min=-self.post_clip_threshold, max=self.post_clip_threshold)
+
+        if control_features:
+            assert isinstance(control_features, dict)
+            for idx, val in control_features.items():
+                z[:, idx] = val
 
+        recon = self.decoder(z)
+        self.kl = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
+        return recon
diff --git a/dpvc/openvoice.py b/dpvc/openvoice.py
index dc367a6..c33e689 100644
--- a/dpvc/openvoice.py
+++ b/dpvc/openvoice.py
@@ -23,16 +23,35 @@ def __init__(self):
         self.tone_color_converter = tone_color_converter
 
     def extract_embedding(self, source_file) -> torch.Tensor:
-        """Extract the speaker embedding from a source .wav file"""
-        source_se, _ = se_extractor.get_se(source_file, self.tone_color_converter,
-                                           target_dir='processed', vad=True)
-        return source_se
+        """Extract the speaker embedding from a source .wav file.
+
+        Calls extract_se directly on the audio file, bypassing OpenVoice's
+        VAD-based splitting which fails on short utterances (<10s).
+        """
+        return self.tone_color_converter.extract_se([source_file])
+
+    def get_vae_config(self):
+        """Return default VAE configuration for OpenVoice."""
+        local_path = os.path.dirname(os.path.abspath(__file__))
+        vae_path = f'{local_path}/openvoice_embedding_vae.pt'
+        return {
+            'checkpoint_path': vae_path,
+            'latent_dim': 6,
+            'input_dim': 256,
+            'clip_threshold': 10.0,
+            'post_clip_threshold': 10.0,
+        }
 
     def inference(self, source_file, output_file, source_embedding, target_embedding):
         """Perform inference with a source file and target speaker embedding,
         writing to the output file"""
+        # OpenVoice expects [1, 256, 1] shaped embeddings
+        if target_embedding.dim() == 2:
+            target_embedding = target_embedding.unsqueeze(-1)
+        if source_embedding.dim() == 2:
+            source_embedding = source_embedding.unsqueeze(-1)
         self.tone_color_converter.convert(
-            audio_src_path=source_file, 
+            audio_src_path=source_file,
             src_se=source_embedding,
             tgt_se=target_embedding,
             output_path=output_file)
diff --git a/dpvc/utils.py b/dpvc/utils.py
index dc69c31..76b1df8 100644
--- a/dpvc/utils.py
+++ b/dpvc/utils.py
@@ -32,31 +32,51 @@ def extract_embeddings(vc_wrapper, dataset: List[str]) -> torch.Tensor:
     return torch.vstack(embeddings).squeeze()
 
 
-def train_autoencoder(model, embeddings, epochs=1000):
-    BATCH_SIZE = min(64, len(embeddings))
-    optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)#, weight_decay=1e-7)
-    outputs = []
-    losses = []
+def train_autoencoder(model, embeddings, epochs=1000, labels=None, lr=1e-5):
+    BATCH_SIZE = min(256, len(embeddings))
+    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
     beta = 1
 
+    if labels is not None:
+        num_labels = len(labels.keys())
+        print('Label ordering:', [f for f in labels])
+        label_vals = [list(labels[f]) for f in labels]
+        label_tensor = torch.tensor(label_vals).to(embeddings.device).squeeze().T
+    else:
+        num_labels = 0
+        label_tensor = None
+
     print(f'Training autoencoder for {epochs} epochs...')
     for epoch in tqdm(range(epochs)):
         with torch.no_grad():
             indexes = torch.randperm(embeddings.shape[0])
-            embeddings = embeddings[indexes]
-            embeddings_batches = torch.split(embeddings, BATCH_SIZE)
+            embeddings_batches = torch.split(embeddings[indexes], BATCH_SIZE)
+            if label_tensor is not None:
+                labels_batches = torch.split(label_tensor[indexes], BATCH_SIZE)
+            else:
+                labels_batches = [None] * len(embeddings_batches)
 
-        for embeddings_b in embeddings_batches:
+        for embeddings_b, labels_b in zip(embeddings_batches, labels_batches):
             optimizer.zero_grad()
 
             reconstructed = model(embeddings_b)
             recon_loss = ((embeddings_b - reconstructed)**2).sum()
-            kl_loss = beta*model.encoder.kl
-            loss = recon_loss + kl_loss
+            kl_loss = beta * model.kl
+
+            if labels_b is not None:
+                label_loss = beta * ((model.last_z[:, :num_labels] - labels_b)**2).sum()
+            else:
+                label_loss = 0
+
+            loss = recon_loss + kl_loss + label_loss
 
             loss.backward()
             optimizer.step()
 
+        if epoch % 10 == 0:
+            label_loss_val = label_loss.item() if isinstance(label_loss, torch.Tensor) else label_loss
+            print(f'loss: {loss.item():.2f}  recon: {recon_loss.item():.2f}  kl: {kl_loss.item():.2f}  label: {label_loss_val:.2f}')
+
     print('Ending loss:', loss.item())
 
 def extract_zip(zip_path: Path, extract_dir: Path) -> Path:
diff --git a/dpvc/wrapper.py b/dpvc/wrapper.py
index 94c534f..895001d 100644
--- a/dpvc/wrapper.py
+++ b/dpvc/wrapper.py
@@ -13,3 +13,11 @@ def inference(self, source_file: str, output_file: str,
         """Perform inference with a source file and target speaker embedding,
         writing to the output file"""
         raise NotImplementedError
+
+    def get_vae_config(self) -> dict:
+        """Return default VAE configuration for this wrapper.
+
+        Returns a dict with keys: checkpoint_path, latent_dim, input_dim,
+        clip_threshold, post_clip_threshold.
+        """
+        raise NotImplementedError
diff --git a/examples/README.md b/examples/README.md
new file mode 100644
index 0000000..bee7b6f
--- /dev/null
+++ b/examples/README.md
@@ -0,0 +1,342 @@
+# Controllable DP Voice Conversion — OpenVoice Pipeline
+
+End-to-end guide for reproducing the controllable voice anonymization results
+using OpenVoice as the VC backend.
+
+This is the repository's **active** controllable pipeline. ControlVC is still
+available as a DP baseline, but style-control work should start here.
+
+## Quick Start
+
+If you just want to run inference with a pre-trained VAE:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source examples/trump_0.wav \
+    --out output/happy.wav \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --style happy
+```
+
+Generate all 9 styles at once:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source examples/trump_0.wav \
+    --out output/ \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --all-styles
+```
+
+Batch-generate the same 10-file set (baseline + 9 styles) for every `.wav`
+in `examples/source_speakers/`:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source-dir examples/source_speakers/ \
+    --out output/diverse_speakers/ \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --all-styles
+```
+
+Each run writes a JSONL manifest by default. Single-file runs create
+`<out_stem>_manifest.jsonl`; batch runs create
+`<out>/generation_manifest.jsonl`.
+
+## Full Pipeline
+
+### 0. Environment Setup
+
+```bash
+# Create venv (tested with Python 3.12 in .venv)
+python -m venv .venv
+source .venv/bin/activate
+
+# Install dpvc with OpenVoice support
+pip install -e ".[openvoice]"
+
+# Dataset extraction
+pip install -e ".[expresso]"
+
+# Optional evaluation stack
+pip install -e ".[eval]"
+```
+
+OpenVoice model checkpoints download automatically on first use to
+`~/.cache/openvoice_checkpoint/`.
+
+Tested Pass 1 environment:
+
+- `torch==2.9.1`
+- `torchaudio==2.9.1`
+- `numpy==2.3.5`
+- `librosa==0.9.1`
+- `soundfile==0.13.1`
+- `datasets==4.8.4`
+- `pandas==3.0.2`
+
+### 1. Extract Expresso Embeddings
+
+Expresso provides 11 styles from 4 speakers. We use 7 styles that map to
+our unified label scheme.
+
+```bash
+python examples/openvoice_extract_expresso.py \
+    --output embeddings/openvoice_expresso_emb.pt
+```
+
+This downloads from HuggingFace (~8GB). If you've already cached the dataset,
+use `--parquet-dir` to skip re-downloading:
+
+```bash
+python examples/openvoice_extract_expresso.py \
+    --output embeddings/openvoice_expresso_emb.pt \
+    --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/*/read
+```
+
+Output: `embeddings/openvoice_expresso_emb.pt` (~8700 embeddings with style labels)
+
+### 2. Extract CREMA-D Embeddings
+
+CREMA-D provides 6 emotions from 91 speakers. The speaker diversity is
+critical — without it, the VAE memorizes speaker identity instead of learning
+style (see FINDINGS.md, Finding 2).
+
+```bash
+python examples/openvoice_extract_cremad.py \
+    --output embeddings/openvoice_cremad_emb.pt
+```
+
+By default, extracts one sample per speaker per emotion (546 samples) to
+maximize diversity. Use `--all-samples` for the full set.
+
+Output: `embeddings/openvoice_cremad_emb.pt` (~546 embeddings with emotion labels)
+
+### 3. Combine Datasets
+
+Merges CREMA-D and Expresso into a unified label scheme with 9 styles:
+anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper.
+
+```bash
+python examples/openvoice_combine_datasets.py \
+    --cremad embeddings/openvoice_cremad_emb.pt \
+    --expresso embeddings/openvoice_expresso_emb.pt \
+    --output embeddings/openvoice_combined_emb.pt \
+    --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/*/read
+```
+
+The `--parquet-dir` is needed to look up Expresso speaker IDs for balanced
+sampling. Styles unique to Expresso (confused, enunciated, whisper) are capped
+at ~90 samples to balance with CREMA-D's ~91 per emotion.
+
+Output: `embeddings/openvoice_combined_emb.pt` (~825 samples, ~90 per style)
+
+### 4. Train the Controllable VAE
+
+```bash
+python examples/openvoice_train_vae_combined.py \
+    --embeddings embeddings/openvoice_combined_emb.pt \
+    --output embeddings/openvoice_vae_combined.pt
+```
+
+Defaults: 3000 epochs, 15 latent dims (9 style + 6 free), lr=1e-6. Training
+takes ~1-2 minutes on CPU.
+
+Watch the loss output — label loss should drop below ~35 by the end. If it
+plateaus above 50, try increasing epochs or lowering lr.
+
+Output: `embeddings/openvoice_vae_combined.pt`
+
+### 5. Run Controllable Inference
+
+Single style:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source examples/trump_0.wav \
+    --out output/whisper.wav \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --style whisper
+```
+
+All styles with DP noise:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source examples/trump_0.wav \
+    --out output/ \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --all-styles \
+    --noise-level 0.1
+```
+
+Batch-generate for a whole directory of sources:
+
+```bash
+python examples/openvoice_infer_controllable.py \
+    --source-dir examples/source_speakers/ \
+    --out output/diverse_speakers/ \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --all-styles \
+    --noise-level 0.1
+```
+
+Key parameters:
+- `--style-strength` (default 5.0): How hard to push the style. Higher = more
+  pronounced but risks output collapse on some speakers. Try 3.0 for safer results.
+- `--noise-level` (default 0.0): DP noise. 0.1 = light privacy with good style
+  preservation. 0.5+ degrades style and can make baseline unintelligible.
+- `--seed` (default 42): For reproducible outputs. Use -1 for random.
+- `--manifest` (optional): Override the default manifest path. The manifest
+  records source path, output path, style, style strength, noise level, seed,
+  checkpoint, and latent dimensions for each generated file.
+
+## Evaluation
+
+Once you have generated outputs (e.g. via `--all-styles`), evaluate emotion
+controllability with `eval_emotion.py`. This runs emotion2vec_plus_large
+(ACL 2024) on each file and reports Recall Rate + emo_sim per the EmoVoice
+paper.
+
+```bash
+# Install the emotion model backend (one-time)
+pip install funasr
+
+# Run on a directory of .wav files (expects <speaker>_<style>.wav naming)
+python examples/eval_emotion.py \
+    --input output/ \
+    --out output/eval_emotion.csv
+```
+
+The script expects files named `<speaker_id>_<style>.wav` — which is what
+`openvoice_infer_controllable.py --all-styles` produces. A file named
+`<speaker_id>_baseline.wav` per speaker is used as the reference for emo_sim.
+
+**Metrics:**
+- **Recall Rate**: predicted emotion == target emotion (only 6 styles have
+  emotion2vec counterparts: anger, disgust, fear, happy, neutral, sad)
+- **emo_sim**: cosine similarity to same-speaker baseline embedding (0–1)
+
+Output is a CSV with one row per file plus a printed summary table.
+
+### WER (intelligibility sanity check)
+
+`eval_wer.py` transcribes each generated .wav with OpenAI Whisper and computes
+Word Error Rate against a reference. Two reference modes:
+
+```bash
+pip install -U openai-whisper jiwer
+
+# Drift-from-baseline (default): compares each <speaker>_<style>.wav to its
+# <speaker>_baseline.wav transcription. Measures how much the style control
+# changes what Whisper hears.
+python examples/eval_wer.py \
+    --input output/diverse_speakers/ \
+    --out output/eval_wer.csv
+
+# Absolute WER against a known ground-truth transcript
+python examples/eval_wer.py \
+    --input output/trump_styles/ \
+    --reference-text "Our great movement. We will make America great again." \
+    --out output/eval_wer_trump.csv
+```
+
+The default (drift) mode is right for the research question "does style
+control degrade intelligibility?". The fixed-reference mode is right for
+reporting absolute WER on a held-out test set.
+
+### MOS (predicted naturalness)
+
+`eval_mos.py` predicts a 1–5 Mean Opinion Score for each generated .wav
+using torchaudio's `SQUIM_SUBJECTIVE` model (Meta, 2023). This is the same
+quantity UTMOS (Saeki 2022, used by EmoVoice) measures; we use SQUIM
+because the `utmos` pip package requires a from-source fairseq build that
+conflicts with this codebase's fairseq monkey-patches.
+
+```bash
+# Score every file; reference = same-speaker baseline (for model conditioning)
+python examples/eval_mos.py \
+    --input output/diverse_speakers/ \
+    --out output/eval_mos.csv
+
+# Use a single fixed reference wav instead
+python examples/eval_mos.py \
+    --input output/diverse_speakers/ \
+    --reference examples/source_speakers/cremad_1007.wav \
+    --out output/eval_mos_fixed.csv
+```
+
+SQUIM_SUBJECTIVE is a **non-matching reference** predictor — it just uses
+the reference waveform to condition the model, not as a comparison target.
+Any clean-speech file works; using the same-speaker baseline makes the
+scores more interpretable.
+
+## Style Reference
+
+| Dim | Style | Acoustic signature |
+|-----|-------|-------------------|
+| 0 | anger | Bright, sharp, edgy |
+| 1 | confused | Hesitant, narrow F0 range |
+| 2 | disgust | Low, withdrawn |
+| 3 | enunciated | Crisp, bright articulation |
+| 4 | fear | Tense, highly variable F0 |
+| 5 | happy | Most animated, 2x F0 variation |
+| 6 | neutral | Flat, subdued, dimmest |
+| 7 | sad | Darker, more varied |
+| 8 | whisper | Near-zero F0, breathy/airy |
+
+## File Inventory
+
+### Scripts (run in order)
+| Step | Script | Input | Output |
+|------|--------|-------|--------|
+| 1 | `openvoice_extract_expresso.py` | HuggingFace / parquet | `openvoice_expresso_emb.pt` |
+| 2 | `openvoice_extract_cremad.py` | HuggingFace | `openvoice_cremad_emb.pt` |
+| 3 | `openvoice_combine_datasets.py` | Step 1 + 2 outputs | `openvoice_combined_emb.pt` |
+| 4 | `openvoice_train_vae_combined.py` | Step 3 output | `openvoice_vae_combined.pt` |
+| 5 | `openvoice_infer_controllable.py` | Step 4 output + audio | `.wav` files |
+| 6 | `eval_emotion.py` | Step 5 output directory | `eval_emotion.csv` |
+| 7 | `eval_wer.py` | Step 5 output directory | `eval_wer.csv` |
+| 8 | `eval_mos.py` | Step 5 output directory | `eval_mos.csv` |
+
+### Other files
+- `openvoice_train_vae.py` — Trains basic (non-controllable) VAE. Not needed for style control.
+- `openvoice_inference.py` — Basic DP inference without style control.
+- `openvoice_extract_commonvoice.py` — CommonVoice extraction (for future pre-training work).
+- `source_speakers/` — CREMA-D audio clips used for diverse speaker evaluation.
+- `trump_0.wav` — Default test source audio.
+
+### ControlVC equivalents (require separate control-vc repo)
+The `controlvc_*` scripts are the ControlVC equivalents. These require cloning
+the control-vc repo and passing `--repo-root`. Note: ControlVC's D_VECTOR
+embedding does not encode style (see FINDINGS.md, Finding 1), so style
+control does not work with ControlVC. Treat those scripts as the ControlVC DP
+baseline path, not as the main controllable pipeline. Setup details live in
+`docs/controlvc_setup.md`.
+
+## Troubleshooting
+
+**"Audio too short, fail to add watermark"**: Source audio is under ~3 seconds.
+OpenVoice needs at least 3-4 seconds. Concatenate clips or use longer audio.
+
+**Output sounds identical to baseline**: Style control strength may be too low.
+Try `--style-strength 5.0` or higher.
+
+**Output is unintelligible / silence**: The speaker-style combination pushed
+the embedding into a degenerate region. Try lower `--style-strength` (3.0) or
+a different source speaker. This affects ~9% of speaker-style combinations.
+
+**"size mismatch" loading VAE**: The VAE checkpoint was trained with different
+`--latent-dims` than what you're loading with. The combined VAE uses 15 dims.
+Pass `--latent-dims 15`.
+
+**fairseq errors on Apple Silicon**: this only applies to the ControlVC
+baseline path. fairseq may fail on MPS; force CPU:
+```bash
+export CUDA_VISIBLE_DEVICES=""
+```
+
+**HuggingFace rate limiting**: Set `HF_TOKEN` for faster downloads:
+```bash
+export HF_TOKEN=hf_xxxxx
+```
diff --git a/examples/controlvc_extract_commonvoice.py b/examples/controlvc_extract_commonvoice.py
index f2a65b9..2ebc984 100644
--- a/examples/controlvc_extract_commonvoice.py
+++ b/examples/controlvc_extract_commonvoice.py
@@ -1,46 +1,64 @@
-import os
-import contextlib
+"""
+Extract speaker embeddings from Common Voice corpus using ControlVC.
+
+Usage:
+    python examples/controlvc_extract_commonvoice.py \
+        --repo-root /path/to/control-vc \
+        --corpus-path /data/cv-corpus-21.0-2025-03-14/en \
+        --output embeddings/controlvc_emb.pt
+"""
+
+import argparse
 import torch
 import pandas as pd
 from tqdm import tqdm
 from pathlib import Path
 from dpvc import ControlVCWrapper
 
-# TODO: not yet updated to use the wrappers
-
-base_path = '/data/cv-corpus-21.0-2025-03-14/en'
-df = pd.read_csv(f'{base_path}/validated.tsv', sep='\t')
-print('number of clips:', len(df))
-clients = list(df['client_id'].unique())
-print('number of clients:', len(clients))
-wrapper = ControlVCWrapper(
-    repo_root=Path("/home/jnear/co/cvc/control-vc"),
-    device="cuda"
-    )
-
-counter = 0
-all_emb = []
-all_labels = []
-for i, ident in tqdm(enumerate(clients), total=len(clients)):
-    # print()
-    # print(f'Processing ID {i} of {len(clients)}')
-    # print('number of sources:', len(df[df['client_id'] == ident]))
-    for source_path in df[df['client_id'] == ident]['path'][:10]:
-        try:
-            source = f'{base_path}/clips/{source_path}'
-            target_se = wrapper.extract_embedding(source)
-            all_emb.append(target_se)
-            all_labels.append(torch.tensor(i))
-        except Exception as e:
-            print(e)
-    if i % 1000 == 0:
-        all_emb_t = torch.vstack(all_emb)
-        all_labels_t = torch.vstack(all_labels)
-        print('saving embeddings:', all_emb_t.shape)
-        torch.save({'data': all_emb_t, 'labels': all_labels_t}, f'all_emb_labeled_cv_full_{i}.pt')
-
-all_emb_t = torch.vstack(all_emb)
-all_labels_t = torch.vstack(all_labels)
-print(all_emb_t.shape)
-print(all_labels_t.shape)
-torch.save({'data': all_emb_t, 'labels': all_labels_t}, 'all_emb_labeled_cv_full.pt')
+
+def main():
+    ap = argparse.ArgumentParser(description="Extract ControlVC embeddings from Common Voice")
+    ap.add_argument("--repo-root", required=True, help="Path to control-vc repository root")
+    ap.add_argument("--corpus-path", required=True, help="Path to Common Voice corpus (e.g. /data/cv-corpus-.../en)")
+    ap.add_argument("--device", default=None, help="Device (default: auto-detect)")
+    ap.add_argument("--output", default="all_emb_labeled_cv_full.pt", help="Output file path")
+    ap.add_argument("--max-clips-per-speaker", type=int, default=10, help="Max clips per speaker")
+    args = ap.parse_args()
+
+    base_path = args.corpus_path
+    df = pd.read_csv(f'{base_path}/validated.tsv', sep='\t')
+    print('number of clips:', len(df))
+    clients = list(df['client_id'].unique())
+    print('number of clients:', len(clients))
+
+    wrapper_kwargs = {"repo_root": Path(args.repo_root)}
+    if args.device:
+        wrapper_kwargs["device"] = args.device
+    wrapper = ControlVCWrapper(**wrapper_kwargs)
+
+    all_emb = []
+    all_labels = []
+    for i, ident in tqdm(enumerate(clients), total=len(clients)):
+        for source_path in df[df['client_id'] == ident]['path'][:args.max_clips_per_speaker]:
+            try:
+                source = f'{base_path}/clips/{source_path}'
+                target_se = wrapper.extract_embedding(source)
+                all_emb.append(target_se)
+                all_labels.append(torch.tensor(i))
+            except Exception as e:
+                print(e)
+        if i % 1000 == 0 and all_emb:
+            all_emb_t = torch.vstack(all_emb)
+            all_labels_t = torch.vstack(all_labels)
+            print('saving checkpoint:', all_emb_t.shape)
+            torch.save({'data': all_emb_t, 'labels': all_labels_t}, f'{args.output}.checkpoint_{i}.pt')
+
+    all_emb_t = torch.vstack(all_emb)
+    all_labels_t = torch.vstack(all_labels)
+    print(all_emb_t.shape)
+    print(all_labels_t.shape)
+    torch.save({'data': all_emb_t, 'labels': all_labels_t}, args.output)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/controlvc_extract_expresso.py b/examples/controlvc_extract_expresso.py
new file mode 100644
index 0000000..feac2cf
--- /dev/null
+++ b/examples/controlvc_extract_expresso.py
@@ -0,0 +1,145 @@
+"""
+Extract speaker embeddings and style labels from the Expresso dataset using ControlVC.
+
+Note: If downloads are slow, set HF_HUB_DISABLE_XET=1 to bypass the xet storage backend.
+
+The Expresso dataset contains expressive speech in 7 styles (default, confused,
+enunciated, happy, laughing, sad, whisper) from 4 speakers. This script extracts
+ControlVC speaker embeddings and encodes styles as one-hot (-1/+1) labels for
+training a controllable VAE.
+
+Usage:
+    python examples/controlvc_extract_expresso.py \
+        --repo-root /path/to/control-vc \
+        --output embeddings/controlvc_expresso_emb.pt
+
+Requires: pip install datasets soundfile
+"""
+
+import argparse
+import os
+import tempfile
+import torch
+import soundfile as sf
+import numpy as np
+from pathlib import Path
+from tqdm import tqdm
+import datasets
+from datasets import load_dataset
+import io
+from dpvc import ControlVCWrapper
+
+STYLES = ['default', 'confused', 'enunciated', 'happy', 'laughing', 'sad', 'whisper',
+          'emphasis', 'essentials', 'longform', 'singing']
+
+
+def encode_style_onehot(style_name):
+    """Encode a style as a one-hot vector with +1 for active, -1 for inactive."""
+    vec = [-1.0] * len(STYLES)
+    if style_name in STYLES:
+        vec[STYLES.index(style_name)] = 1.0
+    return vec
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Extract ControlVC embeddings from Expresso dataset")
+    ap.add_argument("--repo-root", required=True, help="Path to control-vc repository root")
+    ap.add_argument("--output", default="embeddings/controlvc_expresso_emb.pt",
+                    help="Output file path (default: embeddings/controlvc_expresso_emb.pt)")
+    ap.add_argument("--device", default=None, help="Device (default: auto-detect)")
+    ap.add_argument("--max-samples", type=int, default=None,
+                    help="Limit number of samples to process (default: all)")
+    ap.add_argument("--split", default="train", help="Dataset split to use (default: train)")
+    ap.add_argument("--parquet-dir", default=None,
+                    help="Load from local parquet files instead of downloading (e.g. path to cached shards)")
+    args = ap.parse_args()
+
+    # Load Expresso dataset
+    if args.parquet_dir:
+        print(f"Loading Expresso from local parquet files in {args.parquet_dir}...")
+        files = sorted([os.path.join(args.parquet_dir, f)
+                        for f in os.listdir(args.parquet_dir) if f.endswith('.parquet')])
+        slice_spec = f'{args.split}[:{args.max_samples}]' if args.max_samples else args.split
+        dataset = load_dataset('parquet', data_files=files, split=slice_spec)
+    else:
+        print("Loading Expresso dataset from HuggingFace...")
+        dataset = load_dataset("ylacombe/expresso", "read", split=args.split)
+
+    if args.max_samples and not args.parquet_dir:
+        dataset = dataset.select(range(min(args.max_samples, len(dataset))))
+
+    # Disable automatic audio decoding (avoids torchcodec dependency)
+    dataset = dataset.cast_column("audio", datasets.Audio(decode=False))
+
+    print(f"Total samples: {len(dataset)}")
+
+    # Initialize ControlVC wrapper
+    wrapper_kwargs = {"repo_root": Path(args.repo_root)}
+    if args.device:
+        wrapper_kwargs["device"] = args.device
+    wrapper = ControlVCWrapper(**wrapper_kwargs)
+
+    all_emb = []
+    all_ids = []
+    all_styles = {f'style_{s}': [] for s in STYLES}
+    skipped = 0
+
+    for i, sample in tqdm(enumerate(dataset), total=len(dataset)):
+        style = sample.get('style', sample.get('style_tag', ''))
+
+        # Skip styles we don't track (e.g. singing)
+        if style not in STYLES:
+            skipped += 1
+            continue
+
+        try:
+            audio_data = sample['audio']
+            # Decode audio from raw bytes (avoids torchcodec dependency)
+            audio_bytes = audio_data['bytes']
+            audio_array, sample_rate = sf.read(io.BytesIO(audio_bytes), dtype='float32')
+
+            # Write to temp WAV file for the wrapper
+            with tempfile.NamedTemporaryFile(suffix='.wav', delete=True) as tmp:
+                sf.write(tmp.name, audio_array, sample_rate)
+                embedding = wrapper.extract_embedding(tmp.name)
+
+            all_emb.append(embedding)
+            all_ids.append(torch.tensor(i))
+
+            style_vec = encode_style_onehot(style)
+            for j, s in enumerate(STYLES):
+                all_styles[f'style_{s}'].append(torch.tensor(style_vec[j]))
+
+        except Exception as e:
+            print(f"Error processing sample {i}: {e}")
+            skipped += 1
+
+        # Periodic checkpoint
+        if (i + 1) % 1000 == 0 and all_emb:
+            print(f"  Processed {i + 1} samples, {len(all_emb)} embeddings extracted, {skipped} skipped")
+            _save_checkpoint(all_emb, all_ids, all_styles, f"{args.output}.checkpoint_{i}.pt")
+
+    if not all_emb:
+        print("No embeddings extracted. Check dataset and wrapper configuration.")
+        return
+
+    print(f"\nDone: {len(all_emb)} embeddings extracted, {skipped} samples skipped")
+    _save_checkpoint(all_emb, all_ids, all_styles, args.output)
+    print(f"Saved to {args.output}")
+
+
+def _save_checkpoint(all_emb, all_ids, all_styles, path):
+    """Save embeddings and labels to a .pt file."""
+    Path(path).parent.mkdir(parents=True, exist_ok=True)
+    save_dict = {
+        'data': torch.vstack(all_emb),
+        'ids': torch.vstack(all_ids),
+    }
+    for key, vals in all_styles.items():
+        save_dict[key] = torch.vstack(vals)
+    print(f"  Saving: embeddings shape {save_dict['data'].shape}")
+    torch.save(save_dict, path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/controlvc_infer_controllable.py b/examples/controlvc_infer_controllable.py
new file mode 100644
index 0000000..2088f3d
--- /dev/null
+++ b/examples/controlvc_infer_controllable.py
@@ -0,0 +1,84 @@
+"""
+Controllable voice anonymization using ControlVC with a style-trained VAE.
+
+This script demonstrates controlling speaker style properties (e.g., happy,
+whisper, sad) during DP voice anonymization. The first 7 latent dimensions
+of the VAE correspond to Expresso styles:
+
+  0: default     1: confused    2: enunciated    3: happy
+  4: laughing    5: sad         6: whisper       7: emphasis
+  8: essentials  9: longform   10: singing
+
+Pass --style to push a specific dimension to +1 (active).
+Values outside [-1, +1] can produce more extreme effects.
+
+Usage:
+    python examples/controlvc_infer_controllable.py \
+        --repo-root /path/to/control-vc \
+        --source source.wav \
+        --out output.wav \
+        --vae-checkpoint controlvc_vae_expresso.pt \
+        --style happy \
+        --noise-level 0.5
+"""
+
+import argparse
+from pathlib import Path
+import dpvc
+
+
+STYLES = ['default', 'confused', 'enunciated', 'happy', 'laughing', 'sad', 'whisper',
+          'emphasis', 'essentials', 'longform', 'singing']
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Controllable ControlVC voice anonymization")
+    ap.add_argument("--repo-root", required=True, help="Path to control-vc repository root")
+    ap.add_argument("--source", required=True, help="Source audio file path")
+    ap.add_argument("--out", required=True, help="Output audio file path")
+    ap.add_argument("--vae-checkpoint", required=True, help="Path to trained controllable VAE checkpoint")
+    ap.add_argument("--style", default=None, choices=STYLES,
+                    help="Target style to apply (e.g., happy, whisper, sad)")
+    ap.add_argument("--style-value", type=float, default=1.0,
+                    help="Feature value for the target style (default: 1.0, range: -1 to beyond)")
+    ap.add_argument("--noise-level", type=float, default=0.5,
+                    help="DP noise level (default: 0.5, 0 = no noise)")
+    ap.add_argument("--latent-dims", type=int, default=16,
+                    help="Latent dimensions of the VAE (default: 16 = 11 style + 5 free)")
+    ap.add_argument("--device", default=None, help="Device (default: auto-detect)")
+    ap.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility")
+    args = ap.parse_args()
+
+    # Initialize wrapper
+    wrapper_kwargs = {"repo_root": Path(args.repo_root)}
+    if args.device:
+        wrapper_kwargs["device"] = args.device
+    wrapper = dpvc.ControlVCWrapper(**wrapper_kwargs)
+
+    # Configure VAE
+    vae_config = wrapper.get_vae_config()
+    vae_config['checkpoint_path'] = args.vae_checkpoint
+    vae_config['latent_dim'] = args.latent_dims
+
+    anonymizer = dpvc.Anonymizer(wrapper, vae_config=vae_config)
+
+    # Build control features
+    control_features = None
+    if args.style:
+        idx = STYLES.index(args.style)
+        control_features = {idx: args.style_value}
+        print(f"Controlling style: {args.style} (dim {idx} = {args.style_value})")
+
+    # Run anonymization
+    Path(args.out).parent.mkdir(parents=True, exist_ok=True)
+    anonymizer.anonymize(
+        args.source, args.out,
+        noise_level=args.noise_level,
+        seed=args.seed,
+        control_features=control_features,
+    )
+    print(f"Saved anonymized audio to: {args.out}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/controlvc_train_vae_expresso.py b/examples/controlvc_train_vae_expresso.py
new file mode 100644
index 0000000..3c7850a
--- /dev/null
+++ b/examples/controlvc_train_vae_expresso.py
@@ -0,0 +1,68 @@
+"""
+Train a controllable VAE using Expresso-extracted embeddings and style labels.
+
+The trained VAE maps the first 7 latent dimensions to Expresso styles
+(default, confused, enunciated, happy, laughing, sad, whisper), enabling
+controllable voice anonymization.
+
+Usage:
+    python examples/controlvc_train_vae_expresso.py \
+        --embeddings embeddings/controlvc_expresso_emb.pt \
+        --output controlvc_vae_expresso.pt \
+        --epochs 2000
+"""
+
+import argparse
+import torch
+import dpvc
+
+
+STYLES = ['default', 'confused', 'enunciated', 'happy', 'laughing', 'sad', 'whisper',
+          'emphasis', 'essentials', 'longform', 'singing']
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Train controllable VAE with Expresso style labels")
+    ap.add_argument("--embeddings", required=True,
+                    help="Path to extracted embeddings .pt file from controlvc_extract_expresso.py")
+    ap.add_argument("--output", default="controlvc_vae_expresso.pt",
+                    help="Output VAE checkpoint path (default: controlvc_vae_expresso.pt)")
+    ap.add_argument("--epochs", type=int, default=2000, help="Training epochs (default: 2000)")
+    ap.add_argument("--latent-dims", type=int, default=16,
+                    help="Latent dimensions (default: 16 = 11 style dims + 5 free dims)")
+    ap.add_argument("--lr", type=float, default=1e-6,
+                    help="Learning rate (default: 1e-6, slower for label-aware training)")
+    args = ap.parse_args()
+
+    device = "cuda:0" if torch.cuda.is_available() else "cpu"
+
+    data = torch.load(args.embeddings, weights_only=True)
+    embeddings = data['data'].to(device).squeeze()
+    print(f"Shape of extracted embeddings: {embeddings.shape}")
+
+    # Build labels dict from style columns
+    labels = {}
+    for style in STYLES:
+        key = f'style_{style}'
+        if key in data:
+            labels[key] = data[key]
+        else:
+            print(f"Warning: '{key}' not found in embeddings file, skipping")
+
+    if not labels:
+        print("Error: No style labels found in embeddings file.")
+        return
+
+    print(f"Training with {len(labels)} style labels: {list(labels.keys())}")
+    print(f"Latent dims: {args.latent_dims} ({len(labels)} for styles, "
+          f"{args.latent_dims - len(labels)} free for speaker identity)")
+
+    # Train the controllable VAE
+    AE = dpvc.VariationalAutoencoder(latent_dims=args.latent_dims, input_dim=256).to(device)
+    dpvc.utils.train_autoencoder(AE, embeddings, epochs=args.epochs, labels=labels, lr=args.lr)
+    torch.save(AE.state_dict(), args.output)
+    print(f"Saved VAE checkpoint to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/eval_emotion.py b/examples/eval_emotion.py
new file mode 100644
index 0000000..a9441e6
--- /dev/null
+++ b/examples/eval_emotion.py
@@ -0,0 +1,224 @@
+"""
+Emotion evaluation for generated voice outputs.
+
+Runs emotion2vec_plus_large (ACL 2024) on a directory of generated .wav files
+and computes two metrics per the EmoVoice paper (arxiv 2504.12867):
+
+  - Recall Rate: does the predicted emotion match the intended style label?
+  - emo_sim   : cosine similarity of emotion2vec embeddings between each
+                generated file and the same-speaker baseline (no style control).
+
+Filename convention expected:   <speaker_id>_<style>.wav
+(matches output from openvoice_infer_controllable.py --all-styles)
+
+Only 6 of the 9 styles have emotion2vec counterparts and so contribute to
+Recall Rate: anger, disgust, fear, happy, neutral, sad. The remaining
+styles (confused, enunciated, whisper) are still evaluated via emo_sim.
+
+Usage:
+    # Evaluate all files in a directory
+    python examples/eval_emotion.py \
+        --input output/diverse_speakers/ \
+        --out output/eval_emotion.csv
+
+    # Limit to one speaker
+    python examples/eval_emotion.py \
+        --input output/diverse_speakers/ \
+        --prefix cremad_f1_1002 \
+        --out /tmp/single_speaker.csv
+"""
+
+import argparse
+import csv
+from collections import defaultdict
+from pathlib import Path
+
+import numpy as np
+from funasr import AutoModel
+
+
+# Our 9 style labels (order matches the VAE's controllable latent dims)
+STYLES = ['anger', 'confused', 'disgust', 'enunciated', 'fear',
+          'happy', 'neutral', 'sad', 'whisper']
+
+# Map our style -> emotion2vec category label (None = no comparable category)
+STYLE_TO_E2V = {
+    'anger':      'angry',
+    'confused':   None,
+    'disgust':    'disgusted',
+    'enunciated': None,
+    'fear':       'fearful',
+    'happy':      'happy',
+    'neutral':    'neutral',
+    'sad':        'sad',
+    'whisper':    None,
+}
+
+
+def parse_filename(path):
+    """Return (speaker_id, style) parsed from stem, or (None, None)."""
+    stem = path.stem
+    for style in STYLES + ['baseline']:
+        suffix = f"_{style}"
+        if stem.endswith(suffix):
+            return stem[:-len(suffix)], style
+    return None, None
+
+
+def cosine(a, b):
+    a = np.asarray(a).flatten()
+    b = np.asarray(b).flatten()
+    na = a / (np.linalg.norm(a) + 1e-8)
+    nb = b / (np.linalg.norm(b) + 1e-8)
+    return float(np.dot(na, nb))
+
+
+def canonical_label(raw):
+    """Strip any Chinese-language prefix from emotion2vec labels (format: 'zh/en').
+
+    emotion2vec returns labels like 'angry', '<unk>', or sometimes bilingual
+    e.g. '生气/angry'. We always take the English side, lowercased.
+    """
+    if '/' in raw:
+        raw = raw.split('/')[-1]
+    return raw.strip().lower()
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--input", required=True, help="Directory of generated .wav files")
+    ap.add_argument("--out", default="output/eval_emotion.csv",
+                    help="CSV output path (default: output/eval_emotion.csv)")
+    ap.add_argument("--prefix", default=None,
+                    help="Only process files whose name starts with this prefix")
+    ap.add_argument("--model", default="iic/emotion2vec_plus_large",
+                    help="funasr model id (default: iic/emotion2vec_plus_large)")
+    args = ap.parse_args()
+
+    in_dir = Path(args.input)
+    files = sorted(in_dir.glob("*.wav"))
+    if args.prefix:
+        files = [f for f in files if f.name.startswith(args.prefix)]
+    if not files:
+        raise SystemExit(f"No .wav files found in {in_dir}"
+                         + (f" with prefix {args.prefix!r}" if args.prefix else ""))
+
+    # Parse filenames. Skip anything that doesn't match our naming convention.
+    per_file = []
+    by_speaker = defaultdict(dict)
+    for path in files:
+        speaker, style = parse_filename(path)
+        if speaker is None:
+            print(f"  skipping (no style suffix): {path.name}")
+            continue
+        per_file.append((path, speaker, style))
+        by_speaker[speaker][style] = path
+
+    speakers = sorted(by_speaker.keys())
+    speakers_with_baseline = [s for s in speakers if 'baseline' in by_speaker[s]]
+    print(f"Parsed {len(per_file)} files across {len(speakers)} speakers "
+          f"({len(speakers_with_baseline)} with a baseline for emo_sim)")
+
+    # Load emotion2vec
+    print(f"\nLoading {args.model}...")
+    model = AutoModel(model=args.model, hub='hf', disable_update=True)
+    print("Model loaded.\n")
+
+    # Run inference once per file, keeping both the top label and the embedding
+    preds = {}
+    for i, (path, speaker, style) in enumerate(per_file, 1):
+        rec = model.generate(str(path), granularity="utterance",
+                             extract_embedding=True)
+        r = rec[0]
+        labels = [canonical_label(x) for x in r['labels']]
+        scores = r['scores']
+        top_i = int(np.argmax(scores))
+        preds[path] = {
+            'label': labels[top_i],
+            'score': float(scores[top_i]),
+            'embedding': np.asarray(r['feats']).flatten(),
+        }
+        print(f"  [{i:3d}/{len(per_file)}] {path.name:50s} -> "
+              f"{preds[path]['label']:12s} ({preds[path]['score']:.2f})")
+
+    # Compute per-file metrics
+    rows = []
+    correct, evaluable = 0, 0
+    for path, speaker, style in per_file:
+        p = preds[path]
+        target = STYLE_TO_E2V.get(style)
+
+        if style == 'baseline' or target is None:
+            match = ''  # not part of recall denominator
+        else:
+            evaluable += 1
+            is_correct = (p['label'] == target)
+            correct += int(is_correct)
+            match = '1' if is_correct else '0'
+
+        emo_sim = ''
+        baseline_path = by_speaker[speaker].get('baseline')
+        if baseline_path is not None and path != baseline_path:
+            emo_sim = f"{cosine(p['embedding'], preds[baseline_path]['embedding']):.4f}"
+
+        rows.append({
+            'speaker': speaker,
+            'style': style,
+            'predicted': p['label'],
+            'target': target or '',
+            'match': match,
+            'score': f"{p['score']:.4f}",
+            'emo_sim': emo_sim,
+            'file': path.name,
+        })
+
+    # Write CSV
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, 'w', newline='') as f:
+        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
+        writer.writeheader()
+        writer.writerows(rows)
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("Summary")
+    print("=" * 60)
+    print(f"Files evaluated       : {len(per_file)}")
+    print(f"Speakers              : {len(speakers)}")
+    if evaluable:
+        print(f"Overall recall rate   : {correct}/{evaluable} "
+              f"= {100 * correct / evaluable:.1f}%")
+
+    # Per-style recall
+    per_style = defaultdict(lambda: [0, 0])
+    for row in rows:
+        if row['match'] in ('0', '1'):
+            per_style[row['style']][1] += 1
+            per_style[row['style']][0] += int(row['match'])
+    print("\nPer-style recall (only styles with emotion2vec counterparts):")
+    for style in STYLES:
+        c, t = per_style.get(style, (0, 0))
+        if t > 0:
+            print(f"  {style:12s}: {c}/{t} = {100 * c / t:3.0f}%")
+        elif STYLE_TO_E2V.get(style) is None:
+            print(f"  {style:12s}: n/a (no emotion2vec counterpart)")
+
+    # Mean emo_sim per style
+    per_style_sim = defaultdict(list)
+    for row in rows:
+        if row['emo_sim']:
+            per_style_sim[row['style']].append(float(row['emo_sim']))
+    print("\nMean emo_sim vs baseline (higher = closer to baseline emotion):")
+    for style in STYLES:
+        sims = per_style_sim.get(style, [])
+        if sims:
+            print(f"  {style:12s}: mean={np.mean(sims):.3f}  "
+                  f"min={min(sims):.3f}  max={max(sims):.3f}  n={len(sims)}")
+
+    print(f"\nCSV written to: {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/eval_mos.py b/examples/eval_mos.py
new file mode 100644
index 0000000..aeb4d3b
--- /dev/null
+++ b/examples/eval_mos.py
@@ -0,0 +1,223 @@
+"""
+Predicted MOS (naturalness) evaluation for generated voice outputs.
+
+Predicts subjective Mean Opinion Score for each generated .wav file using
+torchaudio's SQUIM_SUBJECTIVE model (Meta, 2023) — a learned MOS predictor
+trained on BVCC + DAPS subjective ratings. Output is on the standard 1-5
+MOS scale (higher = more natural).
+
+Note on UTMOS: the EmoVoice paper (arxiv 2504.12867) uses UTMOS (Saeki et al.
+2022) for this role. UTMOS is distributed via sarulab-speech's `utmos` pip
+package which requires a from-source fairseq build that conflicts with the
+fairseq monkey-patches in this codebase. SQUIM_SUBJECTIVE measures the same
+quantity (predicted subjective MOS on 1-5 scale) on comparable training data,
+so we substitute it here.
+
+Filename convention: <speaker_id>_<style>.wav
+
+SQUIM_SUBJECTIVE is a "non-matching reference" predictor: it takes the test
+audio and any clean-speech waveform as a reference for model conditioning.
+By default we use the same-speaker baseline as the reference, so each
+file is scored with its most natural-sounding cousin. Pass --reference to
+override.
+
+Usage:
+    # Score every file in a directory (default: baseline-as-reference)
+    python examples/eval_mos.py \
+        --input output/diverse_speakers/ \
+        --out output/eval_mos.csv
+
+    # Use a fixed reference wav for all scoring
+    python examples/eval_mos.py \
+        --input output/diverse_speakers/ \
+        --reference examples/source_speakers/cremad_1007.wav \
+        --out output/eval_mos_fixed.csv
+"""
+
+import argparse
+import csv
+import statistics
+from collections import defaultdict
+from pathlib import Path
+
+import numpy as np
+import soundfile as sf
+import torch
+import torchaudio.functional as TAF
+from torchaudio.pipelines import SQUIM_SUBJECTIVE
+
+
+STYLES = ['anger', 'confused', 'disgust', 'enunciated', 'fear',
+          'happy', 'neutral', 'sad', 'whisper']
+
+
+def parse_filename(path):
+    stem = path.stem
+    for style in STYLES + ['baseline']:
+        suffix = f"_{style}"
+        if stem.endswith(suffix):
+            return stem[:-len(suffix)], style
+    return None, None
+
+
+def load_wav(path, target_sr):
+    """Load wav to mono float32 tensor at target_sr, shape [1, T]."""
+    audio, sr = sf.read(str(path), dtype='float32', always_2d=False)
+    if audio.ndim == 2:
+        audio = audio.mean(axis=1)
+    wav = torch.from_numpy(audio).unsqueeze(0)
+    if sr != target_sr:
+        wav = TAF.resample(wav, sr, target_sr)
+    return wav
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--input", required=True, help="Directory of generated .wav files")
+    ap.add_argument("--out", default="output/eval_mos.csv",
+                    help="CSV output path (default: output/eval_mos.csv)")
+    ap.add_argument("--prefix", default=None,
+                    help="Only process files whose name starts with this prefix")
+    ap.add_argument("--reference", default=None,
+                    help="Fixed non-matching reference .wav for all files. "
+                         "If unset, uses each file's same-speaker baseline as reference.")
+    args = ap.parse_args()
+
+    in_dir = Path(args.input)
+    files = sorted(in_dir.glob("*.wav"))
+    if args.prefix:
+        files = [f for f in files if f.name.startswith(args.prefix)]
+    if not files:
+        raise SystemExit(f"No .wav files found in {in_dir}"
+                         + (f" with prefix {args.prefix!r}" if args.prefix else ""))
+
+    per_file = []
+    by_speaker = defaultdict(dict)
+    for path in files:
+        speaker, style = parse_filename(path)
+        if speaker is None:
+            print(f"  skipping (no style suffix): {path.name}")
+            continue
+        per_file.append((path, speaker, style))
+        by_speaker[speaker][style] = path
+
+    speakers = sorted(by_speaker.keys())
+    use_fixed_ref = args.reference is not None
+    if use_fixed_ref:
+        print(f"Parsed {len(per_file)} files. Using fixed reference: {args.reference}")
+    else:
+        speakers_with_baseline = [s for s in speakers if 'baseline' in by_speaker[s]]
+        missing = set(speakers) - set(speakers_with_baseline)
+        print(f"Parsed {len(per_file)} files across {len(speakers)} speakers "
+              f"({len(speakers_with_baseline)} have a baseline for reference).")
+        if missing:
+            print(f"  Warning: no baseline for {sorted(missing)} — those files will be skipped.")
+
+    # Load SQUIM
+    print("\nLoading SQUIM_SUBJECTIVE model...")
+    model = SQUIM_SUBJECTIVE.get_model()
+    model.eval()
+    target_sr = SQUIM_SUBJECTIVE.sample_rate
+    print(f"Model loaded. Target SR: {target_sr} Hz.\n")
+
+    # Cache waveforms keyed by path so we don't re-load baselines
+    wav_cache = {}
+    def get_wav(path):
+        if path not in wav_cache:
+            wav_cache[path] = load_wav(path, target_sr)
+        return wav_cache[path]
+
+    fixed_ref_wav = load_wav(args.reference, target_sr) if use_fixed_ref else None
+
+    # Score every file
+    rows = []
+    scores_by_style = defaultdict(list)
+    baseline_score = {}  # speaker -> baseline MOS (for delta computation)
+
+    with torch.inference_mode():
+        for i, (path, speaker, style) in enumerate(per_file, 1):
+            wav = get_wav(path)
+
+            if use_fixed_ref:
+                ref_wav = fixed_ref_wav
+                ref_source = Path(args.reference).name
+            else:
+                baseline_path = by_speaker[speaker].get('baseline')
+                if baseline_path is None:
+                    rows.append({
+                        'speaker': speaker, 'style': style,
+                        'mos': '', 'delta_vs_baseline': '',
+                        'ref_source': 'no-baseline',
+                        'file': path.name,
+                    })
+                    continue
+                ref_wav = get_wav(baseline_path)
+                ref_source = 'baseline' if path != baseline_path else 'self'
+
+            mos = float(model(wav, ref_wav).detach())
+            rows.append({
+                'speaker': speaker, 'style': style,
+                'mos': f"{mos:.4f}", 'delta_vs_baseline': '',
+                'ref_source': ref_source,
+                'file': path.name,
+            })
+            if style == 'baseline':
+                baseline_score[speaker] = mos
+            scores_by_style[style].append(mos)
+            print(f"  [{i:3d}/{len(per_file)}] {path.name:50s} MOS={mos:.3f}")
+
+    # Fill in delta_vs_baseline now that baseline_score is populated
+    for row in rows:
+        spk = row['speaker']
+        if row['mos'] and spk in baseline_score and row['style'] != 'baseline':
+            delta = float(row['mos']) - baseline_score[spk]
+            row['delta_vs_baseline'] = f"{delta:+.4f}"
+
+    # CSV
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, 'w', newline='') as f:
+        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
+        writer.writeheader()
+        writer.writerows(rows)
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("Summary (scale: 1=bad, 5=excellent)")
+    print("=" * 60)
+    print(f"Reference mode        : "
+          f"{'fixed reference' if use_fixed_ref else 'same-speaker baseline'}")
+    print(f"Files scored          : {sum(1 for r in rows if r['mos'])}")
+
+    all_vals = [float(r['mos']) for r in rows if r['mos']]
+    if all_vals:
+        print(f"Mean MOS              : {statistics.mean(all_vals):.3f}")
+        print(f"Median MOS            : {statistics.median(all_vals):.3f}")
+        print(f"Min / Max             : {min(all_vals):.3f} / {max(all_vals):.3f}")
+
+    print("\nPer-style MOS:")
+    for style in ['baseline'] + STYLES:
+        vals = scores_by_style.get(style, [])
+        if vals:
+            print(f"  {style:12s}: mean={statistics.mean(vals):.3f}  "
+                  f"median={statistics.median(vals):.3f}  "
+                  f"min={min(vals):.3f}  max={max(vals):.3f}  n={len(vals)}")
+
+    if not use_fixed_ref and baseline_score:
+        print("\nPer-style mean MOS delta vs same-speaker baseline (negative = less natural):")
+        per_style_delta = defaultdict(list)
+        for row in rows:
+            if row['delta_vs_baseline']:
+                per_style_delta[row['style']].append(float(row['delta_vs_baseline']))
+        for style in STYLES:
+            deltas = per_style_delta.get(style, [])
+            if deltas:
+                print(f"  {style:12s}: mean={statistics.mean(deltas):+.3f}  "
+                      f"n={len(deltas)}")
+
+    print(f"\nCSV written to: {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/eval_wer.py b/examples/eval_wer.py
new file mode 100644
index 0000000..05839af
--- /dev/null
+++ b/examples/eval_wer.py
@@ -0,0 +1,240 @@
+"""
+Word Error Rate (WER) evaluation for generated voice outputs.
+
+Runs OpenAI Whisper on each generated .wav file and computes WER against a
+reference transcription. Two reference modes:
+
+  (default) baseline mode: the same-speaker baseline (no style control) is
+      transcribed, and each style output for that speaker is compared to its
+      baseline transcription. This measures how much style control degrades
+      intelligibility *relative to no control*.
+
+  --reference-text "..." : compare every file's Whisper transcription against a
+      fixed ground-truth string. Measures absolute WER.
+
+Filename convention expected: <speaker_id>_<style>.wav
+(matches output from openvoice_infer_controllable.py --all-styles)
+
+Usage:
+    # Drift-from-baseline WER (default)
+    python examples/eval_wer.py \
+        --input output/diverse_speakers/ \
+        --out output/eval_wer.csv
+
+    # Absolute WER against known source text
+    python examples/eval_wer.py \
+        --input output/trump_styles/ \
+        --reference-text "Our great movement. We will make America great again." \
+        --out output/eval_wer_trump.csv
+"""
+
+import argparse
+import csv
+import string
+from collections import defaultdict
+from pathlib import Path
+
+import jiwer
+import whisper
+
+
+STYLES = ['anger', 'confused', 'disgust', 'enunciated', 'fear',
+          'happy', 'neutral', 'sad', 'whisper']
+
+
+# jiwer normalisation: lowercase, strip punctuation, collapse whitespace
+TRANSFORM = jiwer.Compose([
+    jiwer.ToLowerCase(),
+    jiwer.RemovePunctuation(),
+    jiwer.RemoveMultipleSpaces(),
+    jiwer.Strip(),
+    jiwer.ReduceToListOfListOfWords(),
+])
+
+
+def parse_filename(path):
+    """Return (speaker_id, style) parsed from stem, or (None, None)."""
+    stem = path.stem
+    for style in STYLES + ['baseline']:
+        suffix = f"_{style}"
+        if stem.endswith(suffix):
+            return stem[:-len(suffix)], style
+    return None, None
+
+
+def clean(text):
+    """Light normalisation for display; WER computation handles its own."""
+    return ' '.join(text.strip().split())
+
+
+def wer_safe(ref, hyp):
+    """Compute WER, returning None if reference is empty (jiwer raises)."""
+    r = clean(ref)
+    h = clean(hyp)
+    if not r:
+        return None
+    try:
+        return jiwer.wer(
+            r, h,
+            reference_transform=TRANSFORM,
+            hypothesis_transform=TRANSFORM,
+        )
+    except ValueError:
+        return None
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--input", required=True, help="Directory of generated .wav files")
+    ap.add_argument("--out", default="output/eval_wer.csv",
+                    help="CSV output path (default: output/eval_wer.csv)")
+    ap.add_argument("--prefix", default=None,
+                    help="Only process files whose name starts with this prefix")
+    ap.add_argument("--model", default="base",
+                    help="Whisper model: tiny / base / small / medium / large-v3 "
+                         "(default: base)")
+    ap.add_argument("--reference-text", default=None,
+                    help="Ground-truth transcript string. If set, every file is "
+                         "compared to this text (absolute WER). If unset, each "
+                         "file is compared to its same-speaker baseline transcription.")
+    args = ap.parse_args()
+
+    in_dir = Path(args.input)
+    files = sorted(in_dir.glob("*.wav"))
+    if args.prefix:
+        files = [f for f in files if f.name.startswith(args.prefix)]
+    if not files:
+        raise SystemExit(f"No .wav files found in {in_dir}"
+                         + (f" with prefix {args.prefix!r}" if args.prefix else ""))
+
+    per_file = []
+    by_speaker = defaultdict(dict)
+    for path in files:
+        speaker, style = parse_filename(path)
+        if speaker is None:
+            print(f"  skipping (no style suffix): {path.name}")
+            continue
+        per_file.append((path, speaker, style))
+        by_speaker[speaker][style] = path
+
+    speakers = sorted(by_speaker.keys())
+    use_baseline = args.reference_text is None
+    if use_baseline:
+        speakers_with_baseline = [s for s in speakers if 'baseline' in by_speaker[s]]
+        missing = set(speakers) - set(speakers_with_baseline)
+        print(f"Parsed {len(per_file)} files across {len(speakers)} speakers "
+              f"({len(speakers_with_baseline)} have a baseline).")
+        if missing:
+            print(f"  Warning: no baseline for {sorted(missing)} — those files "
+                  "will have empty WER.")
+    else:
+        print(f"Parsed {len(per_file)} files. Using fixed reference text "
+              f"({len(args.reference_text.split())} words).")
+
+    # Load Whisper
+    print(f"\nLoading Whisper model '{args.model}'...")
+    model = whisper.load_model(args.model)
+    print("Model loaded.\n")
+
+    # Transcribe each file once
+    transcripts = {}
+    for i, (path, speaker, style) in enumerate(per_file, 1):
+        result = model.transcribe(str(path), fp16=False, verbose=False)
+        txt = clean(result.get('text', ''))
+        transcripts[path] = txt
+        preview = txt[:60] + ('...' if len(txt) > 60 else '')
+        print(f"  [{i:3d}/{len(per_file)}] {path.name:50s} -> {preview}")
+
+    # Compute per-file WER
+    rows = []
+    wers_by_style = defaultdict(list)
+    for path, speaker, style in per_file:
+        hyp = transcripts[path]
+
+        if args.reference_text is not None:
+            ref = args.reference_text
+            ref_source = 'reference-text'
+        else:
+            baseline_path = by_speaker[speaker].get('baseline')
+            if baseline_path is None:
+                rows.append({
+                    'speaker': speaker, 'style': style,
+                    'wer': '', 'ref_source': 'no-baseline',
+                    'reference': '', 'hypothesis': hyp,
+                    'file': path.name,
+                })
+                continue
+            if path == baseline_path:
+                # baseline vs itself = 0.0 by definition; report for completeness
+                rows.append({
+                    'speaker': speaker, 'style': style,
+                    'wer': '0.0000', 'ref_source': 'self',
+                    'reference': transcripts[baseline_path],
+                    'hypothesis': hyp,
+                    'file': path.name,
+                })
+                continue
+            ref = transcripts[baseline_path]
+            ref_source = 'baseline'
+
+        w = wer_safe(ref, hyp)
+        if w is None:
+            rows.append({
+                'speaker': speaker, 'style': style,
+                'wer': '', 'ref_source': ref_source,
+                'reference': ref, 'hypothesis': hyp,
+                'file': path.name,
+            })
+            continue
+
+        rows.append({
+            'speaker': speaker, 'style': style,
+            'wer': f"{w:.4f}", 'ref_source': ref_source,
+            'reference': ref, 'hypothesis': hyp,
+            'file': path.name,
+        })
+        if style != 'baseline':
+            wers_by_style[style].append(w)
+
+    # CSV
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, 'w', newline='') as f:
+        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
+        writer.writeheader()
+        writer.writerows(rows)
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("Summary")
+    print("=" * 60)
+    print(f"Reference mode        : "
+          f"{'fixed reference text' if not use_baseline else 'same-speaker baseline'}")
+    print(f"Whisper model         : {args.model}")
+    print(f"Files transcribed     : {len(per_file)}")
+
+    scored = [r for r in rows if r['wer'] not in ('', '0.0000')]
+    if scored:
+        all_wers = [float(r['wer']) for r in scored]
+        import statistics
+        print(f"WER values scored     : {len(all_wers)}")
+        print(f"Mean WER              : {statistics.mean(all_wers):.3f}")
+        print(f"Median WER            : {statistics.median(all_wers):.3f}")
+        print(f"Min / Max             : {min(all_wers):.3f} / {max(all_wers):.3f}")
+
+    if wers_by_style:
+        print("\nPer-style WER (lower = more intelligible relative to reference):")
+        for style in STYLES:
+            vals = wers_by_style.get(style, [])
+            if vals:
+                import statistics
+                print(f"  {style:12s}: mean={statistics.mean(vals):.3f}  "
+                      f"median={statistics.median(vals):.3f}  "
+                      f"min={min(vals):.3f}  max={max(vals):.3f}  n={len(vals)}")
+
+    print(f"\nCSV written to: {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/openvoice_combine_datasets.py b/examples/openvoice_combine_datasets.py
new file mode 100644
index 0000000..1ab928e
--- /dev/null
+++ b/examples/openvoice_combine_datasets.py
@@ -0,0 +1,132 @@
+"""
+Combine CREMA-D and Expresso embeddings into a unified training set.
+
+Unified labels (9): anger, confused, disgust, enunciated, fear, happy, neutral, sad, whisper
+- CREMA-D (91 speakers): anger, disgust, fear, happy, neutral, sad
+- Expresso (3 speakers, capped): confused, enunciated, whisper
+- Shared labels use CREMA-D + 1-per-speaker from Expresso
+
+Usage:
+    python examples/openvoice_combine_datasets.py \
+        --cremad embeddings/openvoice_cremad_emb.pt \
+        --expresso embeddings/openvoice_expresso_emb.pt \
+        --output embeddings/openvoice_combined_emb.pt \
+        --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/.../read
+"""
+
+import argparse
+import torch
+import numpy as np
+import pandas as pd
+import os
+from collections import Counter, defaultdict
+
+UNIFIED = ['anger', 'confused', 'disgust', 'enunciated', 'fear', 'happy', 'neutral', 'sad', 'whisper']
+
+EXPRESSO_STYLES = ['default', 'confused', 'enunciated', 'happy', 'laughing', 'sad', 'whisper',
+                   'emphasis', 'essentials', 'longform', 'singing']
+EXPRESSO_MAP = {'default': 'neutral', 'confused': 'confused', 'enunciated': 'enunciated',
+                'happy': 'happy', 'sad': 'sad', 'whisper': 'whisper'}
+EXPRESSO_ONLY = {'confused', 'enunciated', 'whisper'}
+
+CREMAD_EMOTIONS = ['anger', 'disgust', 'fear', 'happy', 'neutral', 'sad']
+
+
+def encode_onehot(label):
+    vec = [-1.0] * len(UNIFIED)
+    if label in UNIFIED:
+        vec[UNIFIED.index(label)] = 1.0
+    return vec
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Combine CREMA-D + Expresso embeddings")
+    ap.add_argument("--cremad", default="embeddings/openvoice_cremad_emb.pt")
+    ap.add_argument("--expresso", default="embeddings/openvoice_expresso_emb.pt")
+    ap.add_argument("--output", default="embeddings/openvoice_combined_emb.pt")
+    ap.add_argument("--parquet-dir", required=True,
+                    help="Expresso parquet dir (for speaker IDs)")
+    ap.add_argument("--cap", type=int, default=90,
+                    help="Cap Expresso-only styles to this many samples (default: 90)")
+    args = ap.parse_args()
+
+    # Load CREMA-D
+    cremad = torch.load(args.cremad, weights_only=True)
+    cremad_embs = cremad['data']
+    cremad_labels = []
+    for i in range(len(cremad_embs)):
+        for e in CREMAD_EMOTIONS:
+            if cremad[f'emotion_{e}'][i].item() > 0:
+                cremad_labels.append(e)
+                break
+    print(f"CREMA-D: {len(cremad_embs)} samples")
+
+    # Load Expresso
+    expresso = torch.load(args.expresso, weights_only=True)
+    expresso_embs = expresso['data']
+
+    files = sorted([os.path.join(args.parquet_dir, f)
+                    for f in os.listdir(args.parquet_dir) if f.endswith('.parquet')])
+    df = pd.concat([pd.read_parquet(f, columns=['speaker_id', 'style']) for f in files],
+                   ignore_index=True)
+
+    # Collect Expresso candidates
+    expresso_by_label = defaultdict(list)
+    seen_shared = set()
+    for i in range(len(expresso_embs)):
+        orig_style = None
+        for s in EXPRESSO_STYLES:
+            if expresso[f'style_{s}'][i].item() > 0:
+                orig_style = s
+                break
+        if orig_style not in EXPRESSO_MAP:
+            continue
+        unified = EXPRESSO_MAP[orig_style]
+        if unified in EXPRESSO_ONLY:
+            expresso_by_label[unified].append(i)
+        else:
+            speaker = df.iloc[i]['speaker_id']
+            key = (speaker, unified)
+            if key not in seen_shared:
+                seen_shared.add(key)
+                expresso_by_label[unified].append(i)
+
+    # Cap and select
+    np.random.seed(42)
+    expresso_indices = []
+    expresso_labels_out = []
+    for label, indices in expresso_by_label.items():
+        if label in EXPRESSO_ONLY and len(indices) > args.cap:
+            indices = list(np.random.choice(indices, args.cap, replace=False))
+        for idx in indices:
+            expresso_indices.append(idx)
+            expresso_labels_out.append(label)
+
+    expresso_embs_sel = expresso_embs[expresso_indices]
+    print(f"Expresso: {len(expresso_indices)} samples (filtered)")
+
+    # Combine
+    all_embs = torch.cat([cremad_embs, expresso_embs_sel], dim=0)
+    all_labels = cremad_labels + expresso_labels_out
+
+    label_dict = {f'label_{l}': [] for l in UNIFIED}
+    for label in all_labels:
+        vec = encode_onehot(label)
+        for j, l in enumerate(UNIFIED):
+            label_dict[f'label_{l}'].append(torch.tensor(vec[j]))
+
+    save_dict = {'data': all_embs}
+    for key, vals in label_dict.items():
+        save_dict[key] = torch.vstack(vals)
+
+    print(f"\nCombined: {all_embs.shape[0]} samples, {len(UNIFIED)} labels")
+    dist = Counter(all_labels)
+    for l in UNIFIED:
+        print(f"  {l:12s} {dist.get(l, 0):4d}")
+
+    torch.save(save_dict, args.output)
+    print(f"Saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/openvoice_extract_cremad.py b/examples/openvoice_extract_cremad.py
new file mode 100644
index 0000000..4e92167
--- /dev/null
+++ b/examples/openvoice_extract_cremad.py
@@ -0,0 +1,140 @@
+"""
+Extract speaker embeddings and emotion labels from the CREMA-D dataset using OpenVoice.
+
+CREMA-D has 91 speakers × 6 emotions × ~13 sentences = 7,442 clips.
+By default, extracts one sample per speaker per emotion (546 samples)
+to maximize speaker diversity per Joe's recommendation.
+
+Usage:
+    python examples/openvoice_extract_cremad.py \
+        --output embeddings/openvoice_cremad_emb.pt
+
+    # Extract all samples (not just one per speaker per emotion):
+    python examples/openvoice_extract_cremad.py \
+        --output embeddings/openvoice_cremad_emb_all.pt --all-samples
+
+Requires: pip install datasets soundfile openvoice
+"""
+
+import argparse
+import os
+import tempfile
+import torch
+import soundfile as sf
+import numpy as np
+from pathlib import Path
+from tqdm import tqdm
+from datasets import load_dataset, Audio
+from collections import defaultdict
+import io
+from dpvc import OpenVoiceWrapper
+
+EMOTIONS = ['anger', 'disgust', 'fear', 'happy', 'neutral', 'sad']
+
+
+def encode_emotion_onehot(emotion):
+    """Encode an emotion as a one-hot vector with +1 for active, -1 for inactive."""
+    vec = [-1.0] * len(EMOTIONS)
+    if emotion in EMOTIONS:
+        vec[EMOTIONS.index(emotion)] = 1.0
+    return vec
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Extract OpenVoice embeddings from CREMA-D dataset")
+    ap.add_argument("--output", default="embeddings/openvoice_cremad_emb.pt",
+                    help="Output file path")
+    ap.add_argument("--all-samples", action="store_true",
+                    help="Extract all samples (default: one per speaker per emotion)")
+    ap.add_argument("--max-samples", type=int, default=None,
+                    help="Limit total samples to process")
+    args = ap.parse_args()
+
+    print("Loading CREMA-D dataset from HuggingFace...")
+    dataset = load_dataset('AbstractTTS/CREMA-D', split='train')
+    dataset = dataset.cast_column('audio', Audio(decode=False))
+    print(f"Total samples in dataset: {len(dataset)}")
+
+    # Filter to one sample per speaker per emotion for diversity
+    if not args.all_samples:
+        print("Filtering to one sample per speaker per emotion...")
+        seen = set()
+        indices = []
+        for i, row in enumerate(dataset):
+            speaker = row['file'].split('_')[0]
+            emotion = row['major_emotion']
+            key = (speaker, emotion)
+            if key not in seen and emotion in EMOTIONS:
+                seen.add(key)
+                indices.append(i)
+        dataset = dataset.select(indices)
+        print(f"Filtered to {len(dataset)} samples ({len(seen)} speaker-emotion combos)")
+
+    if args.max_samples:
+        dataset = dataset.select(range(min(args.max_samples, len(dataset))))
+        print(f"Limited to {len(dataset)} samples")
+
+    # Initialize OpenVoice wrapper
+    wrapper = OpenVoiceWrapper()
+
+    all_emb = []
+    all_ids = []
+    all_speakers = []
+    all_emotions = {f'emotion_{e}': [] for e in EMOTIONS}
+    skipped = 0
+
+    for i, row in tqdm(enumerate(dataset), total=len(dataset)):
+        emotion = row['major_emotion']
+        speaker = row['file'].split('_')[0]
+
+        if emotion not in EMOTIONS:
+            skipped += 1
+            continue
+
+        try:
+            audio_bytes = row['audio']['bytes']
+            audio_array, sample_rate = sf.read(io.BytesIO(audio_bytes), dtype='float32')
+
+            with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as tmp:
+                sf.write(tmp.name, audio_array, sample_rate)
+                embedding = wrapper.extract_embedding(tmp.name)
+                os.unlink(tmp.name)
+
+            all_emb.append(embedding)
+            all_ids.append(torch.tensor(i))
+            all_speakers.append(int(speaker))
+
+            emo_vec = encode_emotion_onehot(emotion)
+            for j, e in enumerate(EMOTIONS):
+                all_emotions[f'emotion_{e}'].append(torch.tensor(emo_vec[j]))
+
+        except Exception as e:
+            print(f"Error processing sample {i} ({row['file']}): {e}")
+            skipped += 1
+
+        if len(all_emb) % 100 == 0 and all_emb:
+            print(f"  {len(all_emb)} embeddings extracted, {skipped} skipped")
+
+    if not all_emb:
+        print("No embeddings extracted.")
+        return
+
+    print(f"\nDone: {len(all_emb)} embeddings extracted, {skipped} skipped")
+
+    # Save
+    Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+    save_dict = {
+        'data': torch.vstack(all_emb),
+        'ids': torch.vstack(all_ids),
+        'speakers': torch.tensor(all_speakers),
+    }
+    for key, vals in all_emotions.items():
+        save_dict[key] = torch.vstack(vals)
+
+    print(f"Saving: embeddings shape {save_dict['data'].shape}, {len(set(all_speakers))} unique speakers")
+    torch.save(save_dict, args.output)
+    print(f"Saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/openvoice_extract_expresso.py b/examples/openvoice_extract_expresso.py
new file mode 100644
index 0000000..acff972
--- /dev/null
+++ b/examples/openvoice_extract_expresso.py
@@ -0,0 +1,143 @@
+"""
+Extract speaker embeddings and style labels from the Expresso dataset using OpenVoice.
+
+OpenVoice embeds F0/prosody in the speaker embedding, so style differences
+should be captured in the extracted embeddings (unlike ControlVC's D_VECTOR).
+
+Usage:
+    # From locally cached parquet files (recommended):
+    python examples/openvoice_extract_expresso.py \
+        --output embeddings/openvoice_expresso_emb.pt \
+        --parquet-dir ~/.cache/huggingface/hub/datasets--ylacombe--expresso/snapshots/.../read
+
+    # Or download from HuggingFace:
+    python examples/openvoice_extract_expresso.py \
+        --output embeddings/openvoice_expresso_emb.pt
+
+Requires: pip install pandas soundfile openvoice
+"""
+
+import argparse
+import os
+import tempfile
+import torch
+import soundfile as sf
+import numpy as np
+from pathlib import Path
+from tqdm import tqdm
+import pandas as pd
+import io
+from dpvc import OpenVoiceWrapper
+
+STYLES = ['default', 'confused', 'enunciated', 'happy', 'laughing', 'sad', 'whisper',
+          'emphasis', 'essentials', 'longform', 'singing']
+
+
+def encode_style_onehot(style_name):
+    """Encode a style as a one-hot vector with +1 for active, -1 for inactive."""
+    vec = [-1.0] * len(STYLES)
+    if style_name in STYLES:
+        vec[STYLES.index(style_name)] = 1.0
+    return vec
+
+
+def load_expresso(parquet_dir=None, max_samples=None):
+    """Load Expresso dataset from parquet files or HuggingFace."""
+    if parquet_dir:
+        print(f"Loading Expresso from local parquet files in {parquet_dir}...")
+        files = sorted([os.path.join(parquet_dir, f)
+                        for f in os.listdir(parquet_dir) if f.endswith('.parquet')])
+        dfs = [pd.read_parquet(f) for f in files]
+        df = pd.concat(dfs, ignore_index=True)
+    else:
+        print("Loading Expresso dataset from HuggingFace...")
+        from datasets import load_dataset, Audio
+        dataset = load_dataset("ylacombe/expresso", "read", split="train")
+        dataset = dataset.cast_column("audio", Audio(decode=False))
+        df = dataset.to_pandas()
+
+    if max_samples:
+        df = df.head(max_samples)
+
+    print(f"Total samples: {len(df)}")
+    return df
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Extract OpenVoice embeddings from Expresso dataset")
+    ap.add_argument("--output", default="embeddings/openvoice_expresso_emb.pt",
+                    help="Output file path (default: embeddings/openvoice_expresso_emb.pt)")
+    ap.add_argument("--max-samples", type=int, default=None,
+                    help="Limit number of samples to process (default: all)")
+    ap.add_argument("--parquet-dir", default=None,
+                    help="Load from local parquet files instead of downloading")
+    args = ap.parse_args()
+
+    df = load_expresso(args.parquet_dir, args.max_samples)
+
+    # Initialize OpenVoice wrapper
+    wrapper = OpenVoiceWrapper()
+
+    all_emb = []
+    all_ids = []
+    all_styles = {f'style_{s}': [] for s in STYLES}
+    skipped = 0
+
+    for i, row in tqdm(df.iterrows(), total=len(df)):
+        style = row.get('style', '')
+
+        if style not in STYLES:
+            skipped += 1
+            continue
+
+        try:
+            audio_data = row['audio']
+            audio_bytes = audio_data['bytes']
+            audio_array, sample_rate = sf.read(io.BytesIO(audio_bytes), dtype='float32')
+
+            # Write to temp WAV file for the wrapper
+            with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as tmp:
+                sf.write(tmp.name, audio_array, sample_rate)
+                embedding = wrapper.extract_embedding(tmp.name)
+                os.unlink(tmp.name)
+
+            all_emb.append(embedding)
+            all_ids.append(torch.tensor(i))
+
+            style_vec = encode_style_onehot(style)
+            for j, s in enumerate(STYLES):
+                all_styles[f'style_{s}'].append(torch.tensor(style_vec[j]))
+
+        except Exception as e:
+            print(f"Error processing sample {i}: {e}")
+            skipped += 1
+
+        # Periodic checkpoint
+        if (len(all_emb)) % 1000 == 0 and all_emb:
+            print(f"  {len(all_emb)} embeddings extracted, {skipped} skipped")
+            _save_checkpoint(all_emb, all_ids, all_styles, f"{args.output}.checkpoint.pt")
+
+    if not all_emb:
+        print("No embeddings extracted. Check dataset and wrapper configuration.")
+        return
+
+    print(f"\nDone: {len(all_emb)} embeddings extracted, {skipped} samples skipped")
+    _save_checkpoint(all_emb, all_ids, all_styles, args.output)
+    print(f"Saved to {args.output}")
+
+
+def _save_checkpoint(all_emb, all_ids, all_styles, path):
+    """Save embeddings and labels to a .pt file."""
+    Path(path).parent.mkdir(parents=True, exist_ok=True)
+    save_dict = {
+        'data': torch.vstack(all_emb),
+        'ids': torch.vstack(all_ids),
+    }
+    for key, vals in all_styles.items():
+        save_dict[key] = torch.vstack(vals)
+    print(f"  Saving: embeddings shape {save_dict['data'].shape}")
+    torch.save(save_dict, path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/openvoice_infer_controllable.py b/examples/openvoice_infer_controllable.py
new file mode 100644
index 0000000..4700359
--- /dev/null
+++ b/examples/openvoice_infer_controllable.py
@@ -0,0 +1,287 @@
+"""
+Controllable voice anonymization using OpenVoice with a style-trained VAE.
+
+Applies a target emotion/style to a source voice while anonymizing speaker
+identity with differential privacy noise. The first 9 latent dimensions of
+the combined VAE correspond to:
+
+  0: anger      1: confused    2: disgust     3: enunciated
+  4: fear       5: happy       6: neutral     7: sad
+  8: whisper
+
+Usage:
+    # Apply "happy" style with no DP noise:
+    python examples/openvoice_infer_controllable.py \
+        --source examples/trump_0.wav \
+        --out output/happy.wav \
+        --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+        --style happy
+
+    # Generate all 9 styles for a single file:
+    python examples/openvoice_infer_controllable.py \
+        --source examples/trump_0.wav \
+        --out output/trump_styles/ \
+        --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+        --all-styles
+
+    # Generate all 9 styles for a directory of sources:
+    python examples/openvoice_infer_controllable.py \
+        --source-dir examples/source_speakers/ \
+        --out output/diverse_speakers/ \
+        --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+        --all-styles
+"""
+
+import argparse
+import json
+from pathlib import Path
+
+import dpvc
+
+
+STYLES = ['anger', 'confused', 'disgust', 'enunciated', 'fear',
+          'happy', 'neutral', 'sad', 'whisper']
+AUDIO_SUFFIXES = {'.wav', '.flac', '.mp3', '.m4a', '.ogg'}
+
+
+def collect_sources(source=None, source_dir=None):
+    if source is not None:
+        return [Path(source)]
+
+    root = Path(source_dir)
+    if not root.is_dir():
+        raise FileNotFoundError(f"Source directory not found: {root}")
+
+    sources = sorted(
+        path for path in root.iterdir()
+        if path.is_file() and path.suffix.lower() in AUDIO_SUFFIXES
+    )
+    if not sources:
+        raise FileNotFoundError(
+            f"No audio files with supported extensions found in {root}"
+        )
+    return sources
+
+
+def build_anonymizer(vae_checkpoint, latent_dims):
+    wrapper = dpvc.OpenVoiceWrapper()
+    vae_config = wrapper.get_vae_config()
+    vae_config['checkpoint_path'] = vae_checkpoint
+    vae_config['latent_dim'] = latent_dims
+    return dpvc.Anonymizer(wrapper, vae_config=vae_config)
+
+
+def resolve_manifest_path(args, batch_mode):
+    if args.manifest:
+        return Path(args.manifest)
+
+    out_path = Path(args.out)
+    if batch_mode:
+        return out_path / "generation_manifest.jsonl"
+
+    return out_path.with_name(f"{out_path.stem}_manifest.jsonl")
+
+
+def run_one(anonymizer, source, out_path, style_idx, strength, noise_level, seed):
+    control_features = None
+    if style_idx is not None:
+        control_features = {style_idx: strength}
+
+    out_path = Path(out_path)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    anonymizer.anonymize(
+        str(source),
+        str(out_path),
+        noise_level=noise_level,
+        seed=seed,
+        control_features=control_features,
+    )
+
+
+def build_record(source, output_file, style, style_idx, strength, noise_level,
+                 seed, vae_checkpoint, latent_dims):
+    return {
+        "source_file": str(Path(source).resolve()),
+        "output_file": str(Path(output_file).resolve()),
+        "source_stem": Path(source).stem,
+        "style": style,
+        "style_index": style_idx,
+        "style_strength": strength,
+        "noise_level": noise_level,
+        "seed": seed,
+        "vae_checkpoint": str(Path(vae_checkpoint).resolve()),
+        "latent_dims": latent_dims,
+    }
+
+
+def write_manifest(manifest_path, records):
+    manifest_path = Path(manifest_path)
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    with manifest_path.open("w", encoding="utf-8") as handle:
+        for record in records:
+            handle.write(json.dumps(record) + "\n")
+
+
+def parse_args():
+    ap = argparse.ArgumentParser(
+        description="Controllable OpenVoice voice anonymization")
+    source_group = ap.add_mutually_exclusive_group(required=True)
+    source_group.add_argument("--source", help="Single source audio file path")
+    source_group.add_argument(
+        "--source-dir",
+        help="Directory of source audio files for batch generation",
+    )
+    ap.add_argument(
+        "--out",
+        required=True,
+        help=(
+            "Output file path for single-style single-source runs; output "
+            "directory for --all-styles or --source-dir"
+        ),
+    )
+    ap.add_argument("--vae-checkpoint", required=True,
+                    help="Path to trained controllable VAE checkpoint")
+    ap.add_argument("--style", default=None, choices=STYLES,
+                    help="Target style (e.g., happy, whisper, anger)")
+    ap.add_argument("--all-styles", action="store_true",
+                    help="Generate all 9 styles plus a baseline")
+    ap.add_argument("--style-strength", type=float, default=5.0,
+                    help="Control strength (default: 5.0, higher = stronger style effect)")
+    ap.add_argument("--noise-level", type=float, default=0.0,
+                    help="DP noise level (default: 0.0, try 0.1 for light privacy)")
+    ap.add_argument("--latent-dims", type=int, default=15,
+                    help="VAE latent dimensions (default: 15, must match training)")
+    ap.add_argument("--seed", type=int, default=42,
+                    help="Random seed (default: 42, use -1 for random)")
+    ap.add_argument(
+        "--manifest",
+        help=(
+            "Optional manifest path. Defaults to <out>/generation_manifest.jsonl "
+            "for batch runs or <out_stem>_manifest.jsonl for single-file runs."
+        ),
+    )
+    args = ap.parse_args()
+
+    if args.style and args.all_styles:
+        ap.error("Choose either --style <name> or --all-styles, not both")
+    if not args.style and not args.all_styles:
+        ap.error("Specify --style <name> or --all-styles")
+
+    batch_mode = bool(args.source_dir) or args.all_styles
+    if batch_mode:
+        Path(args.out).mkdir(parents=True, exist_ok=True)
+
+    return args
+
+
+def main():
+    args = parse_args()
+    seed = args.seed if args.seed != -1 else None
+    sources = collect_sources(args.source, args.source_dir)
+    batch_mode = bool(args.source_dir) or args.all_styles
+    manifest_path = resolve_manifest_path(args, batch_mode=batch_mode)
+
+    anonymizer = build_anonymizer(args.vae_checkpoint, args.latent_dims)
+    records = []
+
+    if args.all_styles:
+        out_dir = Path(args.out)
+        for source in sources:
+            src_stem = source.stem
+
+            base_path = out_dir / f"{src_stem}_baseline.wav"
+            print(f"Generating baseline for {source.name} -> {base_path}")
+            run_one(anonymizer, source, base_path, None, 0.0, args.noise_level, seed)
+            records.append(build_record(
+                source=source,
+                output_file=base_path,
+                style="baseline",
+                style_idx=None,
+                strength=0.0,
+                noise_level=args.noise_level,
+                seed=seed,
+                vae_checkpoint=args.vae_checkpoint,
+                latent_dims=args.latent_dims,
+            ))
+
+            for idx, style in enumerate(STYLES):
+                out_path = out_dir / f"{src_stem}_{style}.wav"
+                print(f"Generating {style} for {source.name} -> {out_path}")
+                run_one(
+                    anonymizer,
+                    source,
+                    out_path,
+                    idx,
+                    args.style_strength,
+                    args.noise_level,
+                    seed,
+                )
+                records.append(build_record(
+                    source=source,
+                    output_file=out_path,
+                    style=style,
+                    style_idx=idx,
+                    strength=args.style_strength,
+                    noise_level=args.noise_level,
+                    seed=seed,
+                    vae_checkpoint=args.vae_checkpoint,
+                    latent_dims=args.latent_dims,
+                ))
+    else:
+        idx = STYLES.index(args.style)
+        out_root = Path(args.out)
+        if args.source_dir:
+            for source in sources:
+                out_path = out_root / f"{source.stem}_{args.style}.wav"
+                print(f"Generating {args.style} for {source.name} -> {out_path}")
+                run_one(
+                    anonymizer,
+                    source,
+                    out_path,
+                    idx,
+                    args.style_strength,
+                    args.noise_level,
+                    seed,
+                )
+                records.append(build_record(
+                    source=source,
+                    output_file=out_path,
+                    style=args.style,
+                    style_idx=idx,
+                    strength=args.style_strength,
+                    noise_level=args.noise_level,
+                    seed=seed,
+                    vae_checkpoint=args.vae_checkpoint,
+                    latent_dims=args.latent_dims,
+                ))
+        else:
+            source = sources[0]
+            print(f"Generating {args.style} for {source.name} -> {out_root}")
+            print(f"Noise: {args.noise_level}")
+            run_one(
+                anonymizer,
+                source,
+                out_root,
+                idx,
+                args.style_strength,
+                args.noise_level,
+                seed,
+            )
+            records.append(build_record(
+                source=source,
+                output_file=out_root,
+                style=args.style,
+                style_idx=idx,
+                strength=args.style_strength,
+                noise_level=args.noise_level,
+                seed=seed,
+                vae_checkpoint=args.vae_checkpoint,
+                latent_dims=args.latent_dims,
+            ))
+
+    write_manifest(manifest_path, records)
+    print(f"\nWrote manifest with {len(records)} rows to {manifest_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/openvoice_inference.py b/examples/openvoice_inference.py
index c112dff..49eeb31 100644
--- a/examples/openvoice_inference.py
+++ b/examples/openvoice_inference.py
@@ -5,7 +5,13 @@
 ae_path = None
 
 vc_wrapper = dpvc.OpenVoiceWrapper()
-anonymizer = dpvc.Anonymizer(vc_wrapper, vae_checkpoint_path=ae_path)
+
+if ae_path:
+    vae_config = vc_wrapper.get_vae_config()
+    vae_config['checkpoint_path'] = ae_path
+    anonymizer = dpvc.Anonymizer(vc_wrapper, vae_config=vae_config)
+else:
+    anonymizer = dpvc.Anonymizer(vc_wrapper)
 
 for i in range(10):
     save_path = f'output/openvoice_noisy_{i}.wav'
diff --git a/examples/openvoice_train_vae_combined.py b/examples/openvoice_train_vae_combined.py
new file mode 100644
index 0000000..59fc112
--- /dev/null
+++ b/examples/openvoice_train_vae_combined.py
@@ -0,0 +1,89 @@
+"""
+Train a controllable VAE on combined CREMA-D + Expresso embeddings.
+
+The trained VAE maps the first 9 latent dimensions to unified style labels:
+  0: anger      1: confused    2: disgust     3: enunciated
+  4: fear       5: happy       6: neutral     7: sad
+  8: whisper
+
+Remaining latent dimensions (9-14) are free for speaker identity encoding.
+
+Prerequisites:
+    1. Extract Expresso embeddings:
+       python examples/openvoice_extract_expresso.py
+    2. Extract CREMA-D embeddings:
+       python examples/openvoice_extract_cremad.py
+    3. Combine datasets:
+       python examples/openvoice_combine_datasets.py --parquet-dir <path>
+
+Usage:
+    python examples/openvoice_train_vae_combined.py \
+        --embeddings embeddings/openvoice_combined_emb.pt \
+        --output embeddings/openvoice_vae_combined.pt
+
+    # Custom settings:
+    python examples/openvoice_train_vae_combined.py \
+        --embeddings embeddings/openvoice_combined_emb.pt \
+        --output embeddings/openvoice_vae_combined.pt \
+        --epochs 3000 --latent-dims 15 --lr 1e-6
+"""
+
+import argparse
+import torch
+import dpvc
+
+UNIFIED_STYLES = ['anger', 'confused', 'disgust', 'enunciated', 'fear',
+                  'happy', 'neutral', 'sad', 'whisper']
+
+
+def main():
+    ap = argparse.ArgumentParser(
+        description="Train controllable VAE on combined CREMA-D + Expresso embeddings")
+    ap.add_argument("--embeddings", default="embeddings/openvoice_combined_emb.pt",
+                    help="Path to combined embeddings .pt file (default: embeddings/openvoice_combined_emb.pt)")
+    ap.add_argument("--output", default="embeddings/openvoice_vae_combined.pt",
+                    help="Output VAE checkpoint path (default: embeddings/openvoice_vae_combined.pt)")
+    ap.add_argument("--epochs", type=int, default=3000,
+                    help="Training epochs (default: 3000)")
+    ap.add_argument("--latent-dims", type=int, default=15,
+                    help="Latent dimensions (default: 15 = 9 style dims + 6 free dims)")
+    ap.add_argument("--lr", type=float, default=1e-6,
+                    help="Learning rate (default: 1e-6)")
+    args = ap.parse_args()
+
+    device = "cuda:0" if torch.cuda.is_available() else "cpu"
+
+    # Load combined embeddings
+    data = torch.load(args.embeddings, weights_only=True)
+    embeddings = data['data'].to(device).squeeze()
+    print(f"Embeddings shape: {embeddings.shape}")
+
+    # Build labels dict from label columns
+    labels = {}
+    for style in UNIFIED_STYLES:
+        key = f'label_{style}'
+        if key in data:
+            labels[key] = data[key]
+        else:
+            print(f"Warning: '{key}' not found in embeddings file, skipping")
+
+    if not labels:
+        print("Error: No style labels found. Did you use openvoice_combine_datasets.py?")
+        return
+
+    print(f"Training with {len(labels)} style labels: {list(labels.keys())}")
+    print(f"Latent dims: {args.latent_dims} ({len(labels)} for styles, "
+          f"{args.latent_dims - len(labels)} free)")
+
+    # Train
+    AE = dpvc.VariationalAutoencoder(
+        latent_dims=args.latent_dims, input_dim=256).to(device)
+    dpvc.utils.train_autoencoder(
+        AE, embeddings, epochs=args.epochs, labels=labels, lr=args.lr)
+
+    torch.save(AE.state_dict(), args.output)
+    print(f"Saved VAE checkpoint to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/source_speakers/cremad_1003.wav b/examples/source_speakers/cremad_1003.wav
new file mode 100644
index 0000000..51b488d
Binary files /dev/null and b/examples/source_speakers/cremad_1003.wav differ
diff --git a/examples/source_speakers/cremad_1004.wav b/examples/source_speakers/cremad_1004.wav
new file mode 100644
index 0000000..5d8295c
Binary files /dev/null and b/examples/source_speakers/cremad_1004.wav differ
diff --git a/examples/source_speakers/cremad_1006.wav b/examples/source_speakers/cremad_1006.wav
new file mode 100644
index 0000000..716cec8
Binary files /dev/null and b/examples/source_speakers/cremad_1006.wav differ
diff --git a/examples/source_speakers/cremad_1007.wav b/examples/source_speakers/cremad_1007.wav
new file mode 100644
index 0000000..a13c3e2
Binary files /dev/null and b/examples/source_speakers/cremad_1007.wav differ
diff --git a/examples/source_speakers/cremad_1023.wav b/examples/source_speakers/cremad_1023.wav
new file mode 100644
index 0000000..34a3bc9
Binary files /dev/null and b/examples/source_speakers/cremad_1023.wav differ
diff --git a/examples/source_speakers/cremad_1045.wav b/examples/source_speakers/cremad_1045.wav
new file mode 100644
index 0000000..29325c8
Binary files /dev/null and b/examples/source_speakers/cremad_1045.wav differ
diff --git a/examples/source_speakers/cremad_1076.wav b/examples/source_speakers/cremad_1076.wav
new file mode 100644
index 0000000..5f7b81b
Binary files /dev/null and b/examples/source_speakers/cremad_1076.wav differ
diff --git a/examples/source_speakers/female_1_cremad_1002.wav b/examples/source_speakers/female_1_cremad_1002.wav
new file mode 100644
index 0000000..5efc36b
Binary files /dev/null and b/examples/source_speakers/female_1_cremad_1002.wav differ
diff --git a/examples/source_speakers/female_2_cremad_1012.wav b/examples/source_speakers/female_2_cremad_1012.wav
new file mode 100644
index 0000000..003d281
Binary files /dev/null and b/examples/source_speakers/female_2_cremad_1012.wav differ
diff --git a/examples/source_speakers/male_1_cremad_1003.wav b/examples/source_speakers/male_1_cremad_1003.wav
new file mode 100644
index 0000000..36c2f1a
Binary files /dev/null and b/examples/source_speakers/male_1_cremad_1003.wav differ
diff --git a/examples/source_speakers/male_2_cremad_1051.wav b/examples/source_speakers/male_2_cremad_1051.wav
new file mode 100644
index 0000000..1ce6a56
Binary files /dev/null and b/examples/source_speakers/male_2_cremad_1051.wav differ
diff --git a/pyproject.toml b/pyproject.toml
index c89d785..008f6b9 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -15,11 +15,25 @@ license = {file = "LICENSE"}
 requires-python = ">=3.9,<=3.12"
 dependencies = [
     "torch>=2.4.0",
-    "numpy>=1.22"
+    "numpy>=1.22",
+    "torchaudio",
+    "librosa",
+    "soundfile",
+    "tqdm",
+    "joblib",
+    "requests",
+    "amfm_decompy",
+    "omegaconf==2.0.6",
 ]
 
 [project.optional-dependencies]
 openvoice = ["MyShell-OpenVoice @ git+https://github.com/myshell-ai/OpenVoice.git"]
+expresso = ["datasets", "pandas"]
+eval = [
+    "funasr",           # emotion2vec_plus_large for eval_emotion.py
+    "openai-whisper",   # ASR for eval_wer.py
+    "jiwer",            # WER computation for eval_wer.py
+]
 
 [project.urls]
 Repository = "https://github.com/uvm-plaid/dpvc"
diff --git a/results/README.md b/results/README.md
new file mode 100644
index 0000000..86596a4
--- /dev/null
+++ b/results/README.md
@@ -0,0 +1,63 @@
+# Evaluation Results
+
+Raw per-file evaluation outputs backing the numbers in [`FINDINGS.md`](../FINDINGS.md).
+These are the full 258-file sweep over 27 speaker/variant configurations run on
+2026-04-17 using the combined CREMA-D + Expresso VAE (`embeddings/openvoice_vae_combined.pt`).
+
+| File | Rows | Script | Backs |
+|------|------|--------|-------|
+| `eval_emotion_full.csv` | 258 | [`examples/eval_emotion.py`](../examples/eval_emotion.py) | Finding 7 (Recall Rate, emo_sim) |
+| `eval_wer_full.csv`     | 258 | [`examples/eval_wer.py`](../examples/eval_wer.py)         | Finding 8 (drift-from-baseline WER) |
+| `eval_mos_full.csv`     | 258 | [`examples/eval_mos.py`](../examples/eval_mos.py)         | Finding 9 (SQUIM_SUBJECTIVE predicted MOS) |
+
+## Schema
+
+### `eval_emotion_full.csv`
+`speaker, style, predicted, target, match, score, emo_sim, file`
+
+- `predicted`: emotion2vec_plus_large argmax label (9 classes)
+- `target`: what the generated file is supposed to be
+- `match`: 1 if `predicted == mapped(target)`, 0 otherwise, empty for non-emotional styles (confused/enunciated/whisper) and for baseline rows
+- `score`: softmax probability of the predicted label
+- `emo_sim`: cosine similarity of emotion2vec embedding to same-speaker baseline (empty on baseline rows)
+
+### `eval_wer_full.csv`
+`speaker, style, wer, ref_source, reference, hypothesis, file`
+
+- `wer`: word error rate between hypothesis and reference (lower = better, 0 = identical after normalization)
+- `ref_source`: `baseline` (same-speaker baseline transcript), `self` (baseline file compared against itself), or `fixed` (when run with `--reference-text`)
+
+### `eval_mos_full.csv`
+`speaker, style, mos, delta_vs_baseline, ref_source, file`
+
+- `mos`: SQUIM_SUBJECTIVE predicted MOS, 1–5 scale
+- `delta_vs_baseline`: `mos(row) − mos(same-speaker baseline row)`, empty for baseline rows
+- `ref_source`: what SQUIM used as the non-matching reference (`baseline` or `self`)
+
+## Reproducing
+
+From a clone with the model checkpoint built (see [`examples/README.md`](../examples/README.md) steps 1–4):
+
+```bash
+# Generate the 258-file corpus (steps 1-5 of examples/README.md)
+python examples/openvoice_infer_controllable.py \
+    --source-dir examples/source_speakers/ \
+    --out output/diverse_speakers/ \
+    --vae-checkpoint embeddings/openvoice_vae_combined.pt \
+    --all-styles
+
+# Install eval deps (one-time)
+pip install -e ".[eval]"
+
+# Re-run the three evaluations
+python examples/eval_emotion.py --input output/diverse_speakers/ --out results/eval_emotion_full.csv
+python examples/eval_wer.py     --input output/diverse_speakers/ --out results/eval_wer_full.csv
+python examples/eval_mos.py     --input output/diverse_speakers/ --out results/eval_mos_full.csv
+```
+
+The generation step also writes `output/diverse_speakers/generation_manifest.jsonl`.
+That manifest records the exact source file, output file, style, noise level,
+style strength, seed, and checkpoint used for each row in the evaluation
+corpus.
+
+Numbers should reproduce within rounding.
diff --git a/results/eval_emotion_full.csv b/results/eval_emotion_full.csv
new file mode 100644
index 0000000..bfd26c7
--- /dev/null
+++ b/results/eval_emotion_full.csv
@@ -0,0 +1,259 @@
+speaker,style,predicted,target,match,score,emo_sim,file
+cremad_f1_1002,anger,neutral,angry,0,0.9994,0.9601,cremad_f1_1002_anger.wav
+cremad_f1_1002,baseline,neutral,,,0.9926,,cremad_f1_1002_baseline.wav
+cremad_f1_1002,confused,sad,,,0.9997,0.9375,cremad_f1_1002_confused.wav
+cremad_f1_1002,disgust,neutral,disgusted,0,0.9986,0.9845,cremad_f1_1002_disgust.wav
+cremad_f1_1002,enunciated,sad,,,0.9990,0.9464,cremad_f1_1002_enunciated.wav
+cremad_f1_1002,fear,sad,fearful,0,0.6520,0.9688,cremad_f1_1002_fear.wav
+cremad_f1_1002,happy,sad,happy,0,0.9966,0.9495,cremad_f1_1002_happy.wav
+cremad_f1_1002,neutral,sad,neutral,0,0.5666,0.9818,cremad_f1_1002_neutral.wav
+cremad_f1_1002,sad,sad,sad,1,0.9934,0.9287,cremad_f1_1002_sad.wav
+cremad_f1_1002,whisper,sad,,,0.9122,0.9600,cremad_f1_1002_whisper.wav
+cremad_f2_1012,anger,angry,angry,1,0.9122,0.9107,cremad_f2_1012_anger.wav
+cremad_f2_1012,baseline,neutral,,,1.0000,,cremad_f2_1012_baseline.wav
+cremad_f2_1012,confused,neutral,,,0.6995,0.9494,cremad_f2_1012_confused.wav
+cremad_f2_1012,disgust,neutral,disgusted,0,0.5500,0.9571,cremad_f2_1012_disgust.wav
+cremad_f2_1012,enunciated,disgusted,,,0.5452,0.9453,cremad_f2_1012_enunciated.wav
+cremad_f2_1012,fear,<unk>,fearful,0,0.6039,0.9398,cremad_f2_1012_fear.wav
+cremad_f2_1012,happy,neutral,happy,0,0.8729,0.9711,cremad_f2_1012_happy.wav
+cremad_f2_1012,neutral,neutral,neutral,1,0.9984,0.9863,cremad_f2_1012_neutral.wav
+cremad_f2_1012,sad,neutral,sad,0,0.9903,0.9738,cremad_f2_1012_sad.wav
+cremad_f2_1012,whisper,<unk>,,,0.5959,0.9701,cremad_f2_1012_whisper.wav
+cremad_m1_1003,anger,disgusted,angry,0,0.7431,0.9026,cremad_m1_1003_anger.wav
+cremad_m1_1003,baseline,neutral,,,1.0000,,cremad_m1_1003_baseline.wav
+cremad_m1_1003,confused,neutral,,,0.9640,0.9609,cremad_m1_1003_confused.wav
+cremad_m1_1003,disgust,disgusted,disgusted,1,0.6230,0.9403,cremad_m1_1003_disgust.wav
+cremad_m1_1003,enunciated,neutral,,,0.9973,0.9803,cremad_m1_1003_enunciated.wav
+cremad_m1_1003,fear,neutral,fearful,0,1.0000,0.9950,cremad_m1_1003_fear.wav
+cremad_m1_1003,happy,neutral,happy,0,0.7544,0.9611,cremad_m1_1003_happy.wav
+cremad_m1_1003,neutral,neutral,neutral,1,1.0000,0.9979,cremad_m1_1003_neutral.wav
+cremad_m1_1003,sad,neutral,sad,0,0.8221,0.9505,cremad_m1_1003_sad.wav
+cremad_m1_1003,whisper,disgusted,,,0.9986,0.8636,cremad_m1_1003_whisper.wav
+cremad_m2_1051,anger,neutral,angry,0,0.9698,0.9311,cremad_m2_1051_anger.wav
+cremad_m2_1051,baseline,neutral,,,1.0000,,cremad_m2_1051_baseline.wav
+cremad_m2_1051,confused,neutral,,,0.7960,0.9488,cremad_m2_1051_confused.wav
+cremad_m2_1051,disgust,neutral,disgusted,0,0.9739,0.9575,cremad_m2_1051_disgust.wav
+cremad_m2_1051,enunciated,neutral,,,0.8893,0.9694,cremad_m2_1051_enunciated.wav
+cremad_m2_1051,fear,neutral,fearful,0,0.9999,0.9902,cremad_m2_1051_fear.wav
+cremad_m2_1051,happy,neutral,happy,0,1.0000,0.9986,cremad_m2_1051_happy.wav
+cremad_m2_1051,neutral,neutral,neutral,1,0.9999,0.9969,cremad_m2_1051_neutral.wav
+cremad_m2_1051,sad,neutral,sad,0,0.9975,0.9823,cremad_m2_1051_sad.wav
+cremad_m2_1051,whisper,sad,,,0.8225,0.8910,cremad_m2_1051_whisper.wav
+spk1007,anger,angry,angry,1,0.5332,0.9066,spk1007_anger.wav
+spk1007,baseline,neutral,,,0.9999,,spk1007_baseline.wav
+spk1007,confused,sad,,,0.9932,0.9449,spk1007_confused.wav
+spk1007,disgust,neutral,disgusted,0,0.9988,0.9851,spk1007_disgust.wav
+spk1007_dp01,anger,neutral,angry,0,0.9996,0.9685,spk1007_dp01_anger.wav
+spk1007_dp01,baseline,neutral,,,0.8325,,spk1007_dp01_baseline.wav
+spk1007_dp01,confused,sad,,,0.9989,0.9463,spk1007_dp01_confused.wav
+spk1007_dp01,disgust,neutral,disgusted,0,0.9771,0.9817,spk1007_dp01_disgust.wav
+spk1007_dp01,enunciated,sad,,,0.9991,0.9316,spk1007_dp01_enunciated.wav
+spk1007_dp01,fear,sad,fearful,0,0.9995,0.9205,spk1007_dp01_fear.wav
+spk1007_dp01,happy,neutral,happy,0,0.9257,0.9838,spk1007_dp01_happy.wav
+spk1007_dp01,neutral,neutral,neutral,1,0.9562,0.9970,spk1007_dp01_neutral.wav
+spk1007_dp01,sad,sad,sad,1,0.5744,0.9963,spk1007_dp01_sad.wav
+spk1007_dp01,whisper,neutral,,,0.6369,0.9943,spk1007_dp01_whisper.wav
+spk1007,enunciated,sad,,,0.9931,0.9436,spk1007_enunciated.wav
+spk1007,fear,sad,fearful,0,0.9962,0.9317,spk1007_fear.wav
+spk1007,happy,neutral,happy,0,0.9890,0.9874,spk1007_happy.wav
+spk1007,neutral,sad,neutral,0,0.5045,0.9664,spk1007_neutral.wav
+spk1007_s2.0,anger,neutral,angry,0,0.9378,,spk1007_s2.0_anger.wav
+spk1007_s2.0,confused,sad,,,0.7219,,spk1007_s2.0_confused.wav
+spk1007_s2.0,disgust,neutral,disgusted,0,0.9999,,spk1007_s2.0_disgust.wav
+spk1007_s2.0,enunciated,sad,,,0.5264,,spk1007_s2.0_enunciated.wav
+spk1007_s2.0,fear,neutral,fearful,0,1.0000,,spk1007_s2.0_fear.wav
+spk1007_s2.0,happy,neutral,happy,0,0.9993,,spk1007_s2.0_happy.wav
+spk1007_s2.0,neutral,neutral,neutral,1,0.9999,,spk1007_s2.0_neutral.wav
+spk1007_s2.0,sad,neutral,sad,0,0.9997,,spk1007_s2.0_sad.wav
+spk1007_s2.0,whisper,neutral,,,0.6295,,spk1007_s2.0_whisper.wav
+spk1007_s3.0,anger,neutral,angry,0,0.6501,,spk1007_s3.0_anger.wav
+spk1007_s3.0,confused,sad,,,0.9937,,spk1007_s3.0_confused.wav
+spk1007_s3.0,disgust,neutral,disgusted,0,0.9999,,spk1007_s3.0_disgust.wav
+spk1007_s3.0,enunciated,sad,,,0.9859,,spk1007_s3.0_enunciated.wav
+spk1007_s3.0,fear,neutral,fearful,0,0.9847,,spk1007_s3.0_fear.wav
+spk1007_s3.0,happy,neutral,happy,0,0.8173,,spk1007_s3.0_happy.wav
+spk1007_s3.0,neutral,neutral,neutral,1,0.9983,,spk1007_s3.0_neutral.wav
+spk1007_s3.0,sad,neutral,sad,0,0.7912,,spk1007_s3.0_sad.wav
+spk1007_s3.0,whisper,happy,,,0.5520,,spk1007_s3.0_whisper.wav
+spk1007,sad,neutral,sad,0,0.8134,0.9738,spk1007_sad.wav
+spk1007,whisper,neutral,,,0.8383,0.9690,spk1007_whisper.wav
+spk1023,anger,angry,angry,1,0.7306,0.9044,spk1023_anger.wav
+spk1023,baseline,neutral,,,1.0000,,spk1023_baseline.wav
+spk1023,confused,sad,,,0.9987,0.9074,spk1023_confused.wav
+spk1023,disgust,neutral,disgusted,0,0.9628,0.9576,spk1023_disgust.wav
+spk1023_dp01,anger,angry,angry,1,0.9998,0.8505,spk1023_dp01_anger.wav
+spk1023_dp01,baseline,neutral,,,0.9977,,spk1023_dp01_baseline.wav
+spk1023_dp01,confused,neutral,,,0.9967,0.9896,spk1023_dp01_confused.wav
+spk1023_dp01,disgust,neutral,disgusted,0,0.7615,0.9889,spk1023_dp01_disgust.wav
+spk1023_dp01,enunciated,neutral,,,1.0000,0.9766,spk1023_dp01_enunciated.wav
+spk1023_dp01,fear,neutral,fearful,0,0.8668,0.9812,spk1023_dp01_fear.wav
+spk1023_dp01,happy,angry,happy,0,0.6702,0.9722,spk1023_dp01_happy.wav
+spk1023_dp01,neutral,neutral,neutral,1,0.9870,0.9974,spk1023_dp01_neutral.wav
+spk1023_dp01,sad,neutral,sad,0,0.9959,0.9975,spk1023_dp01_sad.wav
+spk1023_dp01,whisper,disgusted,,,0.9484,0.9528,spk1023_dp01_whisper.wav
+spk1023,enunciated,sad,,,0.9623,0.9321,spk1023_enunciated.wav
+spk1023,fear,sad,fearful,0,0.9993,0.8934,spk1023_fear.wav
+spk1023,happy,neutral,happy,0,0.6527,0.9300,spk1023_happy.wav
+spk1023,neutral,neutral,neutral,1,0.9893,0.9712,spk1023_neutral.wav
+spk1023_s2.0,anger,angry,angry,1,0.9984,,spk1023_s2.0_anger.wav
+spk1023_s2.0,confused,sad,,,0.9207,,spk1023_s2.0_confused.wav
+spk1023_s2.0,disgust,neutral,disgusted,0,0.9973,,spk1023_s2.0_disgust.wav
+spk1023_s2.0,enunciated,sad,,,0.6628,,spk1023_s2.0_enunciated.wav
+spk1023_s2.0,fear,sad,fearful,0,0.5098,,spk1023_s2.0_fear.wav
+spk1023_s2.0,happy,neutral,happy,0,1.0000,,spk1023_s2.0_happy.wav
+spk1023_s2.0,neutral,neutral,neutral,1,0.9999,,spk1023_s2.0_neutral.wav
+spk1023_s2.0,sad,neutral,sad,0,0.7026,,spk1023_s2.0_sad.wav
+spk1023_s2.0,whisper,disgusted,,,0.8637,,spk1023_s2.0_whisper.wav
+spk1023_s3.0,anger,angry,angry,1,0.9926,,spk1023_s3.0_anger.wav
+spk1023_s3.0,confused,sad,,,0.9860,,spk1023_s3.0_confused.wav
+spk1023_s3.0,disgust,neutral,disgusted,0,0.9927,,spk1023_s3.0_disgust.wav
+spk1023_s3.0,enunciated,sad,,,0.9599,,spk1023_s3.0_enunciated.wav
+spk1023_s3.0,fear,sad,fearful,0,0.9931,,spk1023_s3.0_fear.wav
+spk1023_s3.0,happy,neutral,happy,0,0.9998,,spk1023_s3.0_happy.wav
+spk1023_s3.0,neutral,neutral,neutral,1,0.9958,,spk1023_s3.0_neutral.wav
+spk1023_s3.0,sad,neutral,sad,0,0.8910,,spk1023_s3.0_sad.wav
+spk1023_s3.0,whisper,disgusted,,,0.5221,,spk1023_s3.0_whisper.wav
+spk1023,sad,sad,sad,1,0.9807,0.9308,spk1023_sad.wav
+spk1023,whisper,neutral,,,0.3374,0.8804,spk1023_whisper.wav
+spk1045,anger,neutral,angry,0,0.5774,0.9415,spk1045_anger.wav
+spk1045,baseline,neutral,,,0.9999,,spk1045_baseline.wav
+spk1045,confused,sad,,,0.9881,0.9529,spk1045_confused.wav
+spk1045,disgust,neutral,disgusted,0,0.9957,0.9863,spk1045_disgust.wav
+spk1045_dp01,anger,neutral,angry,0,0.9900,0.9780,spk1045_dp01_anger.wav
+spk1045_dp01,baseline,neutral,,,0.9999,,spk1045_dp01_baseline.wav
+spk1045_dp01,confused,sad,,,0.9978,0.8877,spk1045_dp01_confused.wav
+spk1045_dp01,disgust,neutral,disgusted,0,0.9919,0.9939,spk1045_dp01_disgust.wav
+spk1045_dp01,enunciated,neutral,,,1.0000,0.9953,spk1045_dp01_enunciated.wav
+spk1045_dp01,fear,neutral,fearful,0,0.9489,0.9680,spk1045_dp01_fear.wav
+spk1045_dp01,happy,neutral,happy,0,0.9441,0.9778,spk1045_dp01_happy.wav
+spk1045_dp01,neutral,neutral,neutral,1,0.9997,0.9939,spk1045_dp01_neutral.wav
+spk1045_dp01,sad,neutral,sad,0,0.9974,0.9963,spk1045_dp01_sad.wav
+spk1045_dp01,whisper,disgusted,,,0.8007,0.9258,spk1045_dp01_whisper.wav
+spk1045,enunciated,neutral,,,0.8827,0.9822,spk1045_enunciated.wav
+spk1045,fear,sad,fearful,0,0.8036,0.9638,spk1045_fear.wav
+spk1045,happy,neutral,happy,0,0.9650,0.9846,spk1045_happy.wav
+spk1045,neutral,neutral,neutral,1,0.8809,0.9844,spk1045_neutral.wav
+spk1045_s2.0,anger,neutral,angry,0,0.9968,,spk1045_s2.0_anger.wav
+spk1045_s2.0,confused,sad,,,0.7121,,spk1045_s2.0_confused.wav
+spk1045_s2.0,disgust,neutral,disgusted,0,0.9996,,spk1045_s2.0_disgust.wav
+spk1045_s2.0,enunciated,neutral,,,0.5383,,spk1045_s2.0_enunciated.wav
+spk1045_s2.0,fear,neutral,fearful,0,0.8786,,spk1045_s2.0_fear.wav
+spk1045_s2.0,happy,neutral,happy,0,1.0000,,spk1045_s2.0_happy.wav
+spk1045_s2.0,neutral,neutral,neutral,1,0.9990,,spk1045_s2.0_neutral.wav
+spk1045_s2.0,sad,neutral,sad,0,0.9893,,spk1045_s2.0_sad.wav
+spk1045_s2.0,whisper,neutral,,,0.9526,,spk1045_s2.0_whisper.wav
+spk1045_s3.0,anger,neutral,angry,0,0.8228,,spk1045_s3.0_anger.wav
+spk1045_s3.0,confused,sad,,,0.9811,,spk1045_s3.0_confused.wav
+spk1045_s3.0,disgust,neutral,disgusted,0,0.9989,,spk1045_s3.0_disgust.wav
+spk1045_s3.0,enunciated,neutral,,,0.7497,,spk1045_s3.0_enunciated.wav
+spk1045_s3.0,fear,neutral,fearful,0,0.6832,,spk1045_s3.0_fear.wav
+spk1045_s3.0,happy,neutral,happy,0,1.0000,,spk1045_s3.0_happy.wav
+spk1045_s3.0,neutral,neutral,neutral,1,0.9932,,spk1045_s3.0_neutral.wav
+spk1045_s3.0,sad,neutral,sad,0,0.8724,,spk1045_s3.0_sad.wav
+spk1045_s3.0,whisper,neutral,,,0.6425,,spk1045_s3.0_whisper.wav
+spk1045,sad,neutral,sad,0,0.8839,0.9818,spk1045_sad.wav
+spk1045,whisper,disgusted,,,0.4836,0.9487,spk1045_whisper.wav
+spk1076,anger,neutral,angry,0,0.8694,0.8817,spk1076_anger.wav
+spk1076,baseline,neutral,,,1.0000,,spk1076_baseline.wav
+spk1076,confused,neutral,,,0.9999,0.9762,spk1076_confused.wav
+spk1076,disgust,neutral,disgusted,0,0.9919,0.9390,spk1076_disgust.wav
+spk1076_dp01,anger,neutral,angry,0,1.0000,0.9883,spk1076_dp01_anger.wav
+spk1076_dp01,baseline,neutral,,,1.0000,,spk1076_dp01_baseline.wav
+spk1076_dp01,confused,neutral,,,0.9972,0.9650,spk1076_dp01_confused.wav
+spk1076_dp01,disgust,neutral,disgusted,0,0.9997,0.9771,spk1076_dp01_disgust.wav
+spk1076_dp01,enunciated,neutral,,,1.0000,0.9900,spk1076_dp01_enunciated.wav
+spk1076_dp01,fear,neutral,fearful,0,1.0000,0.9977,spk1076_dp01_fear.wav
+spk1076_dp01,happy,angry,happy,0,0.9835,0.8492,spk1076_dp01_happy.wav
+spk1076_dp01,neutral,neutral,neutral,1,1.0000,0.9980,spk1076_dp01_neutral.wav
+spk1076_dp01,sad,neutral,sad,0,1.0000,0.9982,spk1076_dp01_sad.wav
+spk1076_dp01,whisper,neutral,,,0.8341,0.9149,spk1076_dp01_whisper.wav
+spk1076,enunciated,neutral,,,0.7540,0.9246,spk1076_enunciated.wav
+spk1076,fear,neutral,fearful,0,1.0000,0.9755,spk1076_fear.wav
+spk1076,happy,neutral,happy,0,0.9929,0.9566,spk1076_happy.wav
+spk1076,neutral,neutral,neutral,1,1.0000,0.9948,spk1076_neutral.wav
+spk1076_s2.0,anger,neutral,angry,0,1.0000,,spk1076_s2.0_anger.wav
+spk1076_s2.0,confused,neutral,,,1.0000,,spk1076_s2.0_confused.wav
+spk1076_s2.0,disgust,neutral,disgusted,0,1.0000,,spk1076_s2.0_disgust.wav
+spk1076_s2.0,enunciated,neutral,,,1.0000,,spk1076_s2.0_enunciated.wav
+spk1076_s2.0,fear,neutral,fearful,0,1.0000,,spk1076_s2.0_fear.wav
+spk1076_s2.0,happy,neutral,happy,0,1.0000,,spk1076_s2.0_happy.wav
+spk1076_s2.0,neutral,neutral,neutral,1,1.0000,,spk1076_s2.0_neutral.wav
+spk1076_s2.0,sad,neutral,sad,0,1.0000,,spk1076_s2.0_sad.wav
+spk1076_s2.0,whisper,disgusted,,,0.3902,,spk1076_s2.0_whisper.wav
+spk1076_s3.0,anger,neutral,angry,0,0.9989,,spk1076_s3.0_anger.wav
+spk1076_s3.0,confused,neutral,,,0.9999,,spk1076_s3.0_confused.wav
+spk1076_s3.0,disgust,neutral,disgusted,0,1.0000,,spk1076_s3.0_disgust.wav
+spk1076_s3.0,enunciated,neutral,,,1.0000,,spk1076_s3.0_enunciated.wav
+spk1076_s3.0,fear,neutral,fearful,0,1.0000,,spk1076_s3.0_fear.wav
+spk1076_s3.0,happy,neutral,happy,0,0.9998,,spk1076_s3.0_happy.wav
+spk1076_s3.0,neutral,neutral,neutral,1,1.0000,,spk1076_s3.0_neutral.wav
+spk1076_s3.0,sad,neutral,sad,0,1.0000,,spk1076_s3.0_sad.wav
+spk1076_s3.0,whisper,disgusted,,,0.5987,,spk1076_s3.0_whisper.wav
+spk1076,sad,neutral,sad,0,1.0000,0.9830,spk1076_sad.wav
+spk1076,whisper,disgusted,,,0.5172,0.8545,spk1076_whisper.wav
+trump,anger,<unk>,angry,0,0.5675,0.9664,trump_anger.wav
+trump,baseline,angry,,,0.9993,,trump_baseline.wav
+trump,confused,sad,,,0.5302,0.9204,trump_confused.wav
+trump,disgust,<unk>,disgusted,0,0.4291,0.9588,trump_disgust.wav
+trump_dp01,anger,<unk>,angry,0,0.6751,0.9976,trump_dp01_anger.wav
+trump_dp01,baseline,neutral,,,0.2962,,trump_dp01_baseline.wav
+trump_dp01,confused,disgusted,,,0.9960,0.9342,trump_dp01_confused.wav
+trump_dp01,disgust,disgusted,disgusted,1,0.6835,0.9777,trump_dp01_disgust.wav
+trump_dp01,enunciated,disgusted,,,0.8348,0.9724,trump_dp01_enunciated.wav
+trump_dp01,fear,<unk>,fearful,0,0.4314,0.9923,trump_dp01_fear.wav
+trump_dp01,happy,<unk>,happy,0,0.8277,0.9924,trump_dp01_happy.wav
+trump_dp01,neutral,<unk>,neutral,0,0.5128,0.9966,trump_dp01_neutral.wav
+trump_dp01,sad,neutral,sad,0,0.7465,0.9952,trump_dp01_sad.wav
+trump_dp01,whisper,disgusted,,,1.0000,0.6163,trump_dp01_whisper.wav
+trump,enunciated,disgusted,,,0.9407,0.9451,trump_enunciated.wav
+trump,fear,angry,fearful,0,0.3456,0.9413,trump_fear.wav
+trump,happy,<unk>,happy,0,0.8321,0.9467,trump_happy.wav
+trump_male,anger,<unk>,angry,0,0.5675,0.9664,trump_male_anger.wav
+trump_male,baseline,angry,,,0.9993,,trump_male_baseline.wav
+trump_male,confused,sad,,,0.5302,0.9204,trump_male_confused.wav
+trump_male,disgust,<unk>,disgusted,0,0.4291,0.9588,trump_male_disgust.wav
+trump_male,enunciated,disgusted,,,0.9407,0.9451,trump_male_enunciated.wav
+trump_male,fear,angry,fearful,0,0.3456,0.9413,trump_male_fear.wav
+trump_male,happy,<unk>,happy,0,0.8321,0.9467,trump_male_happy.wav
+trump_male,neutral,sad,neutral,0,0.9956,0.8886,trump_male_neutral.wav
+trump_male,sad,neutral,sad,0,0.6434,0.8980,trump_male_sad.wav
+trump_male,whisper,disgusted,,,1.0000,0.6902,trump_male_whisper.wav
+trump,neutral,sad,neutral,0,0.9956,0.8886,trump_neutral.wav
+trump_s10.0,anger,<unk>,angry,0,0.5124,,trump_s10.0_anger.wav
+trump_s10.0,confused,disgusted,,,0.9994,,trump_s10.0_confused.wav
+trump_s10.0,disgust,<unk>,disgusted,0,0.6278,,trump_s10.0_disgust.wav
+trump_s10.0,enunciated,disgusted,,,0.9999,,trump_s10.0_enunciated.wav
+trump_s10.0,fear,sad,fearful,0,0.9885,,trump_s10.0_fear.wav
+trump_s10.0,happy,<unk>,happy,0,0.8298,,trump_s10.0_happy.wav
+trump_s10.0,neutral,sad,neutral,0,0.9997,,trump_s10.0_neutral.wav
+trump_s10.0,sad,sad,sad,1,1.0000,,trump_s10.0_sad.wav
+trump_s10.0,whisper,other,,,0.9902,,trump_s10.0_whisper.wav
+trump_s2.0,anger,angry,angry,1,0.9334,,trump_s2.0_anger.wav
+trump_s2.0,confused,sad,,,0.9930,,trump_s2.0_confused.wav
+trump_s2.0,disgust,angry,disgusted,0,0.8951,,trump_s2.0_disgust.wav
+trump_s2.0,enunciated,angry,,,0.9821,,trump_s2.0_enunciated.wav
+trump_s2.0,fear,angry,fearful,0,0.5385,,trump_s2.0_fear.wav
+trump_s2.0,happy,angry,happy,0,0.9206,,trump_s2.0_happy.wav
+trump_s2.0,neutral,sad,neutral,0,0.6925,,trump_s2.0_neutral.wav
+trump_s2.0,sad,sad,sad,1,0.8355,,trump_s2.0_sad.wav
+trump_s2.0,whisper,angry,,,0.7962,,trump_s2.0_whisper.wav
+trump_s20.0,anger,<unk>,angry,0,0.6924,,trump_s20.0_anger.wav
+trump_s20.0,confused,disgusted,,,0.9997,,trump_s20.0_confused.wav
+trump_s20.0,disgust,disgusted,disgusted,1,0.7057,,trump_s20.0_disgust.wav
+trump_s20.0,enunciated,disgusted,,,1.0000,,trump_s20.0_enunciated.wav
+trump_s20.0,fear,sad,fearful,0,0.6653,,trump_s20.0_fear.wav
+trump_s20.0,happy,disgusted,happy,0,0.8536,,trump_s20.0_happy.wav
+trump_s20.0,neutral,sad,neutral,0,0.8961,,trump_s20.0_neutral.wav
+trump_s20.0,sad,sad,sad,1,0.9777,,trump_s20.0_sad.wav
+trump_s20.0,whisper,sad,,,0.8552,,trump_s20.0_whisper.wav
+trump_s3.0,anger,angry,angry,1,0.7982,,trump_s3.0_anger.wav
+trump_s3.0,confused,sad,,,0.8679,,trump_s3.0_confused.wav
+trump_s3.0,disgust,angry,disgusted,0,0.7322,,trump_s3.0_disgust.wav
+trump_s3.0,enunciated,angry,,,0.9156,,trump_s3.0_enunciated.wav
+trump_s3.0,fear,angry,fearful,0,0.5731,,trump_s3.0_fear.wav
+trump_s3.0,happy,angry,happy,0,0.5549,,trump_s3.0_happy.wav
+trump_s3.0,neutral,sad,neutral,0,0.8636,,trump_s3.0_neutral.wav
+trump_s3.0,sad,sad,sad,1,0.5955,,trump_s3.0_sad.wav
+trump_s3.0,whisper,disgusted,,,0.9102,,trump_s3.0_whisper.wav
+trump,sad,neutral,sad,0,0.6434,0.8980,trump_sad.wav
+trump,whisper,disgusted,,,1.0000,0.6902,trump_whisper.wav
diff --git a/results/eval_mos_full.csv b/results/eval_mos_full.csv
new file mode 100644
index 0000000..55a3bd6
--- /dev/null
+++ b/results/eval_mos_full.csv
@@ -0,0 +1,259 @@
+speaker,style,mos,delta_vs_baseline,ref_source,file
+cremad_f1_1002,anger,3.9614,+0.0315,baseline,cremad_f1_1002_anger.wav
+cremad_f1_1002,baseline,3.9299,,self,cremad_f1_1002_baseline.wav
+cremad_f1_1002,confused,3.7546,-0.1753,baseline,cremad_f1_1002_confused.wav
+cremad_f1_1002,disgust,3.9284,-0.0015,baseline,cremad_f1_1002_disgust.wav
+cremad_f1_1002,enunciated,3.9473,+0.0174,baseline,cremad_f1_1002_enunciated.wav
+cremad_f1_1002,fear,3.9714,+0.0415,baseline,cremad_f1_1002_fear.wav
+cremad_f1_1002,happy,3.9480,+0.0181,baseline,cremad_f1_1002_happy.wav
+cremad_f1_1002,neutral,3.9320,+0.0021,baseline,cremad_f1_1002_neutral.wav
+cremad_f1_1002,sad,3.9630,+0.0331,baseline,cremad_f1_1002_sad.wav
+cremad_f1_1002,whisper,3.5991,-0.3308,baseline,cremad_f1_1002_whisper.wav
+cremad_f2_1012,anger,3.6325,-0.3154,baseline,cremad_f2_1012_anger.wav
+cremad_f2_1012,baseline,3.9479,,self,cremad_f2_1012_baseline.wav
+cremad_f2_1012,confused,3.4016,-0.5463,baseline,cremad_f2_1012_confused.wav
+cremad_f2_1012,disgust,3.9244,-0.0235,baseline,cremad_f2_1012_disgust.wav
+cremad_f2_1012,enunciated,3.9004,-0.0475,baseline,cremad_f2_1012_enunciated.wav
+cremad_f2_1012,fear,3.9680,+0.0201,baseline,cremad_f2_1012_fear.wav
+cremad_f2_1012,happy,3.3920,-0.5559,baseline,cremad_f2_1012_happy.wav
+cremad_f2_1012,neutral,3.9704,+0.0225,baseline,cremad_f2_1012_neutral.wav
+cremad_f2_1012,sad,3.9666,+0.0187,baseline,cremad_f2_1012_sad.wav
+cremad_f2_1012,whisper,3.9692,+0.0213,baseline,cremad_f2_1012_whisper.wav
+cremad_m1_1003,anger,3.9532,-0.0095,baseline,cremad_m1_1003_anger.wav
+cremad_m1_1003,baseline,3.9627,,self,cremad_m1_1003_baseline.wav
+cremad_m1_1003,confused,3.9763,+0.0136,baseline,cremad_m1_1003_confused.wav
+cremad_m1_1003,disgust,3.9722,+0.0095,baseline,cremad_m1_1003_disgust.wav
+cremad_m1_1003,enunciated,3.9672,+0.0045,baseline,cremad_m1_1003_enunciated.wav
+cremad_m1_1003,fear,3.9758,+0.0131,baseline,cremad_m1_1003_fear.wav
+cremad_m1_1003,happy,3.9789,+0.0162,baseline,cremad_m1_1003_happy.wav
+cremad_m1_1003,neutral,3.9739,+0.0112,baseline,cremad_m1_1003_neutral.wav
+cremad_m1_1003,sad,3.9750,+0.0123,baseline,cremad_m1_1003_sad.wav
+cremad_m1_1003,whisper,3.7354,-0.2273,baseline,cremad_m1_1003_whisper.wav
+cremad_m2_1051,anger,1.4068,-2.5729,baseline,cremad_m2_1051_anger.wav
+cremad_m2_1051,baseline,3.9797,,self,cremad_m2_1051_baseline.wav
+cremad_m2_1051,confused,3.2182,-0.7615,baseline,cremad_m2_1051_confused.wav
+cremad_m2_1051,disgust,3.9356,-0.0441,baseline,cremad_m2_1051_disgust.wav
+cremad_m2_1051,enunciated,3.9149,-0.0648,baseline,cremad_m2_1051_enunciated.wav
+cremad_m2_1051,fear,3.9786,-0.0011,baseline,cremad_m2_1051_fear.wav
+cremad_m2_1051,happy,3.9844,+0.0047,baseline,cremad_m2_1051_happy.wav
+cremad_m2_1051,neutral,3.9714,-0.0083,baseline,cremad_m2_1051_neutral.wav
+cremad_m2_1051,sad,3.9824,+0.0027,baseline,cremad_m2_1051_sad.wav
+cremad_m2_1051,whisper,3.9788,-0.0009,baseline,cremad_m2_1051_whisper.wav
+spk1007,anger,3.9647,+0.0386,baseline,spk1007_anger.wav
+spk1007,baseline,3.9261,,self,spk1007_baseline.wav
+spk1007,confused,3.3474,-0.5787,baseline,spk1007_confused.wav
+spk1007,disgust,3.8721,-0.0540,baseline,spk1007_disgust.wav
+spk1007_dp01,anger,3.9882,+0.0008,baseline,spk1007_dp01_anger.wav
+spk1007_dp01,baseline,3.9874,,self,spk1007_dp01_baseline.wav
+spk1007_dp01,confused,3.0637,-0.9237,baseline,spk1007_dp01_confused.wav
+spk1007_dp01,disgust,3.9857,-0.0017,baseline,spk1007_dp01_disgust.wav
+spk1007_dp01,enunciated,3.7697,-0.2177,baseline,spk1007_dp01_enunciated.wav
+spk1007_dp01,fear,3.9806,-0.0068,baseline,spk1007_dp01_fear.wav
+spk1007_dp01,happy,3.9854,-0.0020,baseline,spk1007_dp01_happy.wav
+spk1007_dp01,neutral,3.9872,-0.0002,baseline,spk1007_dp01_neutral.wav
+spk1007_dp01,sad,3.9784,-0.0090,baseline,spk1007_dp01_sad.wav
+spk1007_dp01,whisper,3.9148,-0.0726,baseline,spk1007_dp01_whisper.wav
+spk1007,enunciated,3.9720,+0.0459,baseline,spk1007_enunciated.wav
+spk1007,fear,3.9662,+0.0401,baseline,spk1007_fear.wav
+spk1007,happy,3.9684,+0.0423,baseline,spk1007_happy.wav
+spk1007,neutral,3.9158,-0.0103,baseline,spk1007_neutral.wav
+spk1007_s2.0,anger,,,no-baseline,spk1007_s2.0_anger.wav
+spk1007_s2.0,confused,,,no-baseline,spk1007_s2.0_confused.wav
+spk1007_s2.0,disgust,,,no-baseline,spk1007_s2.0_disgust.wav
+spk1007_s2.0,enunciated,,,no-baseline,spk1007_s2.0_enunciated.wav
+spk1007_s2.0,fear,,,no-baseline,spk1007_s2.0_fear.wav
+spk1007_s2.0,happy,,,no-baseline,spk1007_s2.0_happy.wav
+spk1007_s2.0,neutral,,,no-baseline,spk1007_s2.0_neutral.wav
+spk1007_s2.0,sad,,,no-baseline,spk1007_s2.0_sad.wav
+spk1007_s2.0,whisper,,,no-baseline,spk1007_s2.0_whisper.wav
+spk1007_s3.0,anger,,,no-baseline,spk1007_s3.0_anger.wav
+spk1007_s3.0,confused,,,no-baseline,spk1007_s3.0_confused.wav
+spk1007_s3.0,disgust,,,no-baseline,spk1007_s3.0_disgust.wav
+spk1007_s3.0,enunciated,,,no-baseline,spk1007_s3.0_enunciated.wav
+spk1007_s3.0,fear,,,no-baseline,spk1007_s3.0_fear.wav
+spk1007_s3.0,happy,,,no-baseline,spk1007_s3.0_happy.wav
+spk1007_s3.0,neutral,,,no-baseline,spk1007_s3.0_neutral.wav
+spk1007_s3.0,sad,,,no-baseline,spk1007_s3.0_sad.wav
+spk1007_s3.0,whisper,,,no-baseline,spk1007_s3.0_whisper.wav
+spk1007,sad,3.8979,-0.0282,baseline,spk1007_sad.wav
+spk1007,whisper,3.7239,-0.2022,baseline,spk1007_whisper.wav
+spk1023,anger,3.9775,+0.0158,baseline,spk1023_anger.wav
+spk1023,baseline,3.9617,,self,spk1023_baseline.wav
+spk1023,confused,3.9410,-0.0207,baseline,spk1023_confused.wav
+spk1023,disgust,3.9127,-0.0490,baseline,spk1023_disgust.wav
+spk1023_dp01,anger,3.9795,-0.0070,baseline,spk1023_dp01_anger.wav
+spk1023_dp01,baseline,3.9865,,self,spk1023_dp01_baseline.wav
+spk1023_dp01,confused,2.8920,-1.0945,baseline,spk1023_dp01_confused.wav
+spk1023_dp01,disgust,3.9792,-0.0073,baseline,spk1023_dp01_disgust.wav
+spk1023_dp01,enunciated,3.0685,-0.9180,baseline,spk1023_dp01_enunciated.wav
+spk1023_dp01,fear,3.9857,-0.0008,baseline,spk1023_dp01_fear.wav
+spk1023_dp01,happy,3.9873,+0.0008,baseline,spk1023_dp01_happy.wav
+spk1023_dp01,neutral,3.9849,-0.0016,baseline,spk1023_dp01_neutral.wav
+spk1023_dp01,sad,3.9880,+0.0015,baseline,spk1023_dp01_sad.wav
+spk1023_dp01,whisper,3.9043,-0.0822,baseline,spk1023_dp01_whisper.wav
+spk1023,enunciated,3.8842,-0.0775,baseline,spk1023_enunciated.wav
+spk1023,fear,3.9772,+0.0155,baseline,spk1023_fear.wav
+spk1023,happy,3.9755,+0.0138,baseline,spk1023_happy.wav
+spk1023,neutral,3.9468,-0.0149,baseline,spk1023_neutral.wav
+spk1023_s2.0,anger,,,no-baseline,spk1023_s2.0_anger.wav
+spk1023_s2.0,confused,,,no-baseline,spk1023_s2.0_confused.wav
+spk1023_s2.0,disgust,,,no-baseline,spk1023_s2.0_disgust.wav
+spk1023_s2.0,enunciated,,,no-baseline,spk1023_s2.0_enunciated.wav
+spk1023_s2.0,fear,,,no-baseline,spk1023_s2.0_fear.wav
+spk1023_s2.0,happy,,,no-baseline,spk1023_s2.0_happy.wav
+spk1023_s2.0,neutral,,,no-baseline,spk1023_s2.0_neutral.wav
+spk1023_s2.0,sad,,,no-baseline,spk1023_s2.0_sad.wav
+spk1023_s2.0,whisper,,,no-baseline,spk1023_s2.0_whisper.wav
+spk1023_s3.0,anger,,,no-baseline,spk1023_s3.0_anger.wav
+spk1023_s3.0,confused,,,no-baseline,spk1023_s3.0_confused.wav
+spk1023_s3.0,disgust,,,no-baseline,spk1023_s3.0_disgust.wav
+spk1023_s3.0,enunciated,,,no-baseline,spk1023_s3.0_enunciated.wav
+spk1023_s3.0,fear,,,no-baseline,spk1023_s3.0_fear.wav
+spk1023_s3.0,happy,,,no-baseline,spk1023_s3.0_happy.wav
+spk1023_s3.0,neutral,,,no-baseline,spk1023_s3.0_neutral.wav
+spk1023_s3.0,sad,,,no-baseline,spk1023_s3.0_sad.wav
+spk1023_s3.0,whisper,,,no-baseline,spk1023_s3.0_whisper.wav
+spk1023,sad,3.9482,-0.0135,baseline,spk1023_sad.wav
+spk1023,whisper,3.9264,-0.0353,baseline,spk1023_whisper.wav
+spk1045,anger,3.9746,+0.0176,baseline,spk1045_anger.wav
+spk1045,baseline,3.9570,,self,spk1045_baseline.wav
+spk1045,confused,3.8633,-0.0937,baseline,spk1045_confused.wav
+spk1045,disgust,3.8827,-0.0743,baseline,spk1045_disgust.wav
+spk1045_dp01,anger,4.0392,+0.0135,baseline,spk1045_dp01_anger.wav
+spk1045_dp01,baseline,4.0257,,self,spk1045_dp01_baseline.wav
+spk1045_dp01,confused,3.1241,-0.9016,baseline,spk1045_dp01_confused.wav
+spk1045_dp01,disgust,3.9808,-0.0449,baseline,spk1045_dp01_disgust.wav
+spk1045_dp01,enunciated,3.4702,-0.5555,baseline,spk1045_dp01_enunciated.wav
+spk1045_dp01,fear,4.0600,+0.0343,baseline,spk1045_dp01_fear.wav
+spk1045_dp01,happy,4.0304,+0.0047,baseline,spk1045_dp01_happy.wav
+spk1045_dp01,neutral,3.9882,-0.0375,baseline,spk1045_dp01_neutral.wav
+spk1045_dp01,sad,4.2695,+0.2438,baseline,spk1045_dp01_sad.wav
+spk1045_dp01,whisper,4.1704,+0.1447,baseline,spk1045_dp01_whisper.wav
+spk1045,enunciated,3.8089,-0.1481,baseline,spk1045_enunciated.wav
+spk1045,fear,3.9732,+0.0162,baseline,spk1045_fear.wav
+spk1045,happy,3.9723,+0.0153,baseline,spk1045_happy.wav
+spk1045,neutral,3.9358,-0.0212,baseline,spk1045_neutral.wav
+spk1045_s2.0,anger,,,no-baseline,spk1045_s2.0_anger.wav
+spk1045_s2.0,confused,,,no-baseline,spk1045_s2.0_confused.wav
+spk1045_s2.0,disgust,,,no-baseline,spk1045_s2.0_disgust.wav
+spk1045_s2.0,enunciated,,,no-baseline,spk1045_s2.0_enunciated.wav
+spk1045_s2.0,fear,,,no-baseline,spk1045_s2.0_fear.wav
+spk1045_s2.0,happy,,,no-baseline,spk1045_s2.0_happy.wav
+spk1045_s2.0,neutral,,,no-baseline,spk1045_s2.0_neutral.wav
+spk1045_s2.0,sad,,,no-baseline,spk1045_s2.0_sad.wav
+spk1045_s2.0,whisper,,,no-baseline,spk1045_s2.0_whisper.wav
+spk1045_s3.0,anger,,,no-baseline,spk1045_s3.0_anger.wav
+spk1045_s3.0,confused,,,no-baseline,spk1045_s3.0_confused.wav
+spk1045_s3.0,disgust,,,no-baseline,spk1045_s3.0_disgust.wav
+spk1045_s3.0,enunciated,,,no-baseline,spk1045_s3.0_enunciated.wav
+spk1045_s3.0,fear,,,no-baseline,spk1045_s3.0_fear.wav
+spk1045_s3.0,happy,,,no-baseline,spk1045_s3.0_happy.wav
+spk1045_s3.0,neutral,,,no-baseline,spk1045_s3.0_neutral.wav
+spk1045_s3.0,sad,,,no-baseline,spk1045_s3.0_sad.wav
+spk1045_s3.0,whisper,,,no-baseline,spk1045_s3.0_whisper.wav
+spk1045,sad,3.9231,-0.0339,baseline,spk1045_sad.wav
+spk1045,whisper,3.8415,-0.1155,baseline,spk1045_whisper.wav
+spk1076,anger,3.8988,-0.0267,baseline,spk1076_anger.wav
+spk1076,baseline,3.9255,,self,spk1076_baseline.wav
+spk1076,confused,3.9398,+0.0143,baseline,spk1076_confused.wav
+spk1076,disgust,3.9115,-0.0140,baseline,spk1076_disgust.wav
+spk1076_dp01,anger,3.9923,+0.0056,baseline,spk1076_dp01_anger.wav
+spk1076_dp01,baseline,3.9867,,self,spk1076_dp01_baseline.wav
+spk1076_dp01,confused,3.8078,-0.1789,baseline,spk1076_dp01_confused.wav
+spk1076_dp01,disgust,3.9877,+0.0010,baseline,spk1076_dp01_disgust.wav
+spk1076_dp01,enunciated,4.0424,+0.0557,baseline,spk1076_dp01_enunciated.wav
+spk1076_dp01,fear,4.0068,+0.0201,baseline,spk1076_dp01_fear.wav
+spk1076_dp01,happy,3.9908,+0.0041,baseline,spk1076_dp01_happy.wav
+spk1076_dp01,neutral,3.9864,-0.0003,baseline,spk1076_dp01_neutral.wav
+spk1076_dp01,sad,4.0150,+0.0283,baseline,spk1076_dp01_sad.wav
+spk1076_dp01,whisper,4.0235,+0.0368,baseline,spk1076_dp01_whisper.wav
+spk1076,enunciated,3.9670,+0.0415,baseline,spk1076_enunciated.wav
+spk1076,fear,3.9727,+0.0472,baseline,spk1076_fear.wav
+spk1076,happy,3.9705,+0.0450,baseline,spk1076_happy.wav
+spk1076,neutral,3.9486,+0.0231,baseline,spk1076_neutral.wav
+spk1076_s2.0,anger,,,no-baseline,spk1076_s2.0_anger.wav
+spk1076_s2.0,confused,,,no-baseline,spk1076_s2.0_confused.wav
+spk1076_s2.0,disgust,,,no-baseline,spk1076_s2.0_disgust.wav
+spk1076_s2.0,enunciated,,,no-baseline,spk1076_s2.0_enunciated.wav
+spk1076_s2.0,fear,,,no-baseline,spk1076_s2.0_fear.wav
+spk1076_s2.0,happy,,,no-baseline,spk1076_s2.0_happy.wav
+spk1076_s2.0,neutral,,,no-baseline,spk1076_s2.0_neutral.wav
+spk1076_s2.0,sad,,,no-baseline,spk1076_s2.0_sad.wav
+spk1076_s2.0,whisper,,,no-baseline,spk1076_s2.0_whisper.wav
+spk1076_s3.0,anger,,,no-baseline,spk1076_s3.0_anger.wav
+spk1076_s3.0,confused,,,no-baseline,spk1076_s3.0_confused.wav
+spk1076_s3.0,disgust,,,no-baseline,spk1076_s3.0_disgust.wav
+spk1076_s3.0,enunciated,,,no-baseline,spk1076_s3.0_enunciated.wav
+spk1076_s3.0,fear,,,no-baseline,spk1076_s3.0_fear.wav
+spk1076_s3.0,happy,,,no-baseline,spk1076_s3.0_happy.wav
+spk1076_s3.0,neutral,,,no-baseline,spk1076_s3.0_neutral.wav
+spk1076_s3.0,sad,,,no-baseline,spk1076_s3.0_sad.wav
+spk1076_s3.0,whisper,,,no-baseline,spk1076_s3.0_whisper.wav
+spk1076,sad,3.9435,+0.0180,baseline,spk1076_sad.wav
+spk1076,whisper,3.9038,-0.0217,baseline,spk1076_whisper.wav
+trump,anger,4.3768,-0.0209,baseline,trump_anger.wav
+trump,baseline,4.3977,,self,trump_baseline.wav
+trump,confused,4.4043,+0.0066,baseline,trump_confused.wav
+trump,disgust,4.4107,+0.0130,baseline,trump_disgust.wav
+trump_dp01,anger,2.8842,-1.5635,baseline,trump_dp01_anger.wav
+trump_dp01,baseline,4.4477,,self,trump_dp01_baseline.wav
+trump_dp01,confused,4.4210,-0.0267,baseline,trump_dp01_confused.wav
+trump_dp01,disgust,4.2169,-0.2308,baseline,trump_dp01_disgust.wav
+trump_dp01,enunciated,4.4477,-0.0000,baseline,trump_dp01_enunciated.wav
+trump_dp01,fear,4.3999,-0.0478,baseline,trump_dp01_fear.wav
+trump_dp01,happy,4.4110,-0.0367,baseline,trump_dp01_happy.wav
+trump_dp01,neutral,4.3268,-0.1209,baseline,trump_dp01_neutral.wav
+trump_dp01,sad,4.2866,-0.1611,baseline,trump_dp01_sad.wav
+trump_dp01,whisper,2.5029,-1.9448,baseline,trump_dp01_whisper.wav
+trump,enunciated,4.4101,+0.0124,baseline,trump_enunciated.wav
+trump,fear,4.4444,+0.0467,baseline,trump_fear.wav
+trump,happy,4.3871,-0.0106,baseline,trump_happy.wav
+trump_male,anger,4.3768,-0.0209,baseline,trump_male_anger.wav
+trump_male,baseline,4.3977,,self,trump_male_baseline.wav
+trump_male,confused,4.4043,+0.0066,baseline,trump_male_confused.wav
+trump_male,disgust,4.4107,+0.0130,baseline,trump_male_disgust.wav
+trump_male,enunciated,4.4101,+0.0124,baseline,trump_male_enunciated.wav
+trump_male,fear,4.4444,+0.0467,baseline,trump_male_fear.wav
+trump_male,happy,4.3871,-0.0106,baseline,trump_male_happy.wav
+trump_male,neutral,4.4025,+0.0048,baseline,trump_male_neutral.wav
+trump_male,sad,4.4382,+0.0405,baseline,trump_male_sad.wav
+trump_male,whisper,2.5764,-1.8213,baseline,trump_male_whisper.wav
+trump,neutral,4.4025,+0.0048,baseline,trump_neutral.wav
+trump_s10.0,anger,,,no-baseline,trump_s10.0_anger.wav
+trump_s10.0,confused,,,no-baseline,trump_s10.0_confused.wav
+trump_s10.0,disgust,,,no-baseline,trump_s10.0_disgust.wav
+trump_s10.0,enunciated,,,no-baseline,trump_s10.0_enunciated.wav
+trump_s10.0,fear,,,no-baseline,trump_s10.0_fear.wav
+trump_s10.0,happy,,,no-baseline,trump_s10.0_happy.wav
+trump_s10.0,neutral,,,no-baseline,trump_s10.0_neutral.wav
+trump_s10.0,sad,,,no-baseline,trump_s10.0_sad.wav
+trump_s10.0,whisper,,,no-baseline,trump_s10.0_whisper.wav
+trump_s2.0,anger,,,no-baseline,trump_s2.0_anger.wav
+trump_s2.0,confused,,,no-baseline,trump_s2.0_confused.wav
+trump_s2.0,disgust,,,no-baseline,trump_s2.0_disgust.wav
+trump_s2.0,enunciated,,,no-baseline,trump_s2.0_enunciated.wav
+trump_s2.0,fear,,,no-baseline,trump_s2.0_fear.wav
+trump_s2.0,happy,,,no-baseline,trump_s2.0_happy.wav
+trump_s2.0,neutral,,,no-baseline,trump_s2.0_neutral.wav
+trump_s2.0,sad,,,no-baseline,trump_s2.0_sad.wav
+trump_s2.0,whisper,,,no-baseline,trump_s2.0_whisper.wav
+trump_s20.0,anger,,,no-baseline,trump_s20.0_anger.wav
+trump_s20.0,confused,,,no-baseline,trump_s20.0_confused.wav
+trump_s20.0,disgust,,,no-baseline,trump_s20.0_disgust.wav
+trump_s20.0,enunciated,,,no-baseline,trump_s20.0_enunciated.wav
+trump_s20.0,fear,,,no-baseline,trump_s20.0_fear.wav
+trump_s20.0,happy,,,no-baseline,trump_s20.0_happy.wav
+trump_s20.0,neutral,,,no-baseline,trump_s20.0_neutral.wav
+trump_s20.0,sad,,,no-baseline,trump_s20.0_sad.wav
+trump_s20.0,whisper,,,no-baseline,trump_s20.0_whisper.wav
+trump_s3.0,anger,,,no-baseline,trump_s3.0_anger.wav
+trump_s3.0,confused,,,no-baseline,trump_s3.0_confused.wav
+trump_s3.0,disgust,,,no-baseline,trump_s3.0_disgust.wav
+trump_s3.0,enunciated,,,no-baseline,trump_s3.0_enunciated.wav
+trump_s3.0,fear,,,no-baseline,trump_s3.0_fear.wav
+trump_s3.0,happy,,,no-baseline,trump_s3.0_happy.wav
+trump_s3.0,neutral,,,no-baseline,trump_s3.0_neutral.wav
+trump_s3.0,sad,,,no-baseline,trump_s3.0_sad.wav
+trump_s3.0,whisper,,,no-baseline,trump_s3.0_whisper.wav
+trump,sad,4.4382,+0.0405,baseline,trump_sad.wav
+trump,whisper,2.5764,-1.8213,baseline,trump_whisper.wav
diff --git a/results/eval_wer_full.csv b/results/eval_wer_full.csv
new file mode 100644
index 0000000..44a7045
--- /dev/null
+++ b/results/eval_wer_full.csv
@@ -0,0 +1,259 @@
+speaker,style,wer,ref_source,reference,hypothesis,file
+cremad_f1_1002,anger,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_f1_1002_anger.wav
+cremad_f1_1002,baseline,0.0000,self,Don't forget a jacket.,Don't forget a jacket.,cremad_f1_1002_baseline.wav
+cremad_f1_1002,confused,0.2500,baseline,Don't forget a jacket.,Don't forget the jacket.,cremad_f1_1002_confused.wav
+cremad_f1_1002,disgust,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_f1_1002_disgust.wav
+cremad_f1_1002,enunciated,0.2500,baseline,Don't forget a jacket.,Don't forget Jacket.,cremad_f1_1002_enunciated.wav
+cremad_f1_1002,fear,0.2500,baseline,Don't forget a jacket.,Don't forget the jacket.,cremad_f1_1002_fear.wav
+cremad_f1_1002,happy,0.2500,baseline,Don't forget a jacket.,Don't forget the jacket.,cremad_f1_1002_happy.wav
+cremad_f1_1002,neutral,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_f1_1002_neutral.wav
+cremad_f1_1002,sad,0.7500,baseline,Don't forget a jacket.,Don't forget to jack it.,cremad_f1_1002_sad.wav
+cremad_f1_1002,whisper,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_f1_1002_whisper.wav
+cremad_f2_1012,anger,0.2000,baseline,Don't forget to jack it.,Don't forget to zap it.,cremad_f2_1012_anger.wav
+cremad_f2_1012,baseline,0.0000,self,Don't forget to jack it.,Don't forget to jack it.,cremad_f2_1012_baseline.wav
+cremad_f2_1012,confused,0.8000,baseline,Don't forget to jack it.,Go brigands and jam it.,cremad_f2_1012_confused.wav
+cremad_f2_1012,disgust,0.2000,baseline,Don't forget to jack it.,Don't forget to jam it.,cremad_f2_1012_disgust.wav
+cremad_f2_1012,enunciated,0.6000,baseline,Don't forget to jack it.,So forget to jump in.,cremad_f2_1012_enunciated.wav
+cremad_f2_1012,fear,0.2000,baseline,Don't forget to jack it.,Don't forget to check it.,cremad_f2_1012_fear.wav
+cremad_f2_1012,happy,0.0000,baseline,Don't forget to jack it.,Don't forget to jack it!,cremad_f2_1012_happy.wav
+cremad_f2_1012,neutral,0.0000,baseline,Don't forget to jack it.,Don't forget to jack it.,cremad_f2_1012_neutral.wav
+cremad_f2_1012,sad,0.2000,baseline,Don't forget to jack it.,Don't forget to tap it.,cremad_f2_1012_sad.wav
+cremad_f2_1012,whisper,0.2000,baseline,Don't forget to jack it.,Don't forget to chat it.,cremad_f2_1012_whisper.wav
+cremad_m1_1003,anger,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_anger.wav
+cremad_m1_1003,baseline,0.0000,self,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_baseline.wav
+cremad_m1_1003,confused,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_confused.wav
+cremad_m1_1003,disgust,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_disgust.wav
+cremad_m1_1003,enunciated,0.5000,baseline,Don't forget a jacket.,"Don't forget, Adjacent.",cremad_m1_1003_enunciated.wav
+cremad_m1_1003,fear,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_fear.wav
+cremad_m1_1003,happy,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_happy.wav
+cremad_m1_1003,neutral,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_neutral.wav
+cremad_m1_1003,sad,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m1_1003_sad.wav
+cremad_m1_1003,whisper,0.2500,baseline,Don't forget a jacket.,Don't forget the jacket.,cremad_m1_1003_whisper.wav
+cremad_m2_1051,anger,0.7500,baseline,Don't forget a jacket.,Don't forget to jack you,cremad_m2_1051_anger.wav
+cremad_m2_1051,baseline,0.0000,self,Don't forget a jacket.,Don't forget a jacket.,cremad_m2_1051_baseline.wav
+cremad_m2_1051,confused,1.0000,baseline,Don't forget a jacket.,Go forget to judge it.,cremad_m2_1051_confused.wav
+cremad_m2_1051,disgust,0.7500,baseline,Don't forget a jacket.,Don't forget to jack it.,cremad_m2_1051_disgust.wav
+cremad_m2_1051,enunciated,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m2_1051_enunciated.wav
+cremad_m2_1051,fear,0.7500,baseline,Don't forget a jacket.,Don't forget to jack it.,cremad_m2_1051_fear.wav
+cremad_m2_1051,happy,0.7500,baseline,Don't forget a jacket.,Don't forget to jack it.,cremad_m2_1051_happy.wav
+cremad_m2_1051,neutral,0.7500,baseline,Don't forget a jacket.,Don't forget to jack it.,cremad_m2_1051_neutral.wav
+cremad_m2_1051,sad,0.0000,baseline,Don't forget a jacket.,Don't forget a jacket.,cremad_m2_1051_sad.wav
+cremad_m2_1051,whisper,1.0000,baseline,Don't forget a jacket.,I'm gonna track it.,cremad_m2_1051_whisper.wav
+spk1007,anger,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a Jacket. It's 11th law.,spk1007_anger.wav
+spk1007,baseline,0.0000,self,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_baseline.wav
+spk1007,confused,0.2500,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a limited room.,spk1007_confused.wav
+spk1007,disgust,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's eleven o'clock,spk1007_disgust.wav
+spk1007_dp01,anger,0.2500,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a Jacket. It's a level claw.,spk1007_dp01_anger.wav
+spk1007_dp01,baseline,0.0000,self,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_dp01_baseline.wav
+spk1007_dp01,confused,0.1250,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather quag.,spk1007_dp01_confused.wav
+spk1007_dp01,disgust,0.5000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a giant. It's 11 o'clock.,spk1007_dp01_disgust.wav
+spk1007_dp01,enunciated,0.0000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_dp01_enunciated.wav
+spk1007_dp01,fear,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's eleven o'clock.,spk1007_dp01_fear.wav
+spk1007_dp01,happy,0.0000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_dp01_happy.wav
+spk1007_dp01,neutral,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's eleven o'clock.,spk1007_dp01_neutral.wav
+spk1007_dp01,sad,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's eleven o'clock.,spk1007_dp01_sad.wav
+spk1007_dp01,whisper,0.6250,baseline,Don't forget a jacket. It's a leather cloth.,"Don't, but yet a jacket. It's eleven o'clock.",spk1007_dp01_whisper.wav
+spk1007,enunciated,0.5000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget the Jackie. It's a living dream.,spk1007_enunciated.wav
+spk1007,fear,0.0000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_fear.wav
+spk1007,happy,0.1250,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a lemon cloth.,spk1007_happy.wav
+spk1007,neutral,0.0000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's a leather cloth.,spk1007_neutral.wav
+spk1007_s2.0,anger,,no-baseline,,Don't forget a jacket. It's a leather cloth.,spk1007_s2.0_anger.wav
+spk1007_s2.0,confused,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1007_s2.0_confused.wav
+spk1007_s2.0,disgust,,no-baseline,,Don't forget a jacket. It's eleven o'clock.,spk1007_s2.0_disgust.wav
+spk1007_s2.0,enunciated,,no-baseline,,Don't forget a jacket. It's a living room.,spk1007_s2.0_enunciated.wav
+spk1007_s2.0,fear,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1007_s2.0_fear.wav
+spk1007_s2.0,happy,,no-baseline,,Don't forget a giant. It's a love and a heart.,spk1007_s2.0_happy.wav
+spk1007_s2.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1007_s2.0_neutral.wav
+spk1007_s2.0,sad,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1007_s2.0_sad.wav
+spk1007_s2.0,whisper,,no-baseline,,"No, but yet I'm yawning. It's 11 o'clock.",spk1007_s2.0_whisper.wav
+spk1007_s3.0,anger,,no-baseline,,Don't forget a jacket. It's a leather cloth.,spk1007_s3.0_anger.wav
+spk1007_s3.0,confused,,no-baseline,,Don't forget a jacket. It's a leather crown.,spk1007_s3.0_confused.wav
+spk1007_s3.0,disgust,,no-baseline,,Don't forget a jacket. It's eleven o'clock.,spk1007_s3.0_disgust.wav
+spk1007_s3.0,enunciated,,no-baseline,,Don't forget the jacket. It's 11 o'clock.,spk1007_s3.0_enunciated.wav
+spk1007_s3.0,fear,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1007_s3.0_fear.wav
+spk1007_s3.0,happy,,no-baseline,,Don't forget a giant. It's a little claw.,spk1007_s3.0_happy.wav
+spk1007_s3.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock,spk1007_s3.0_neutral.wav
+spk1007_s3.0,sad,,no-baseline,,Don't forget a jacket. It's eleven o'clock.,spk1007_s3.0_sad.wav
+spk1007_s3.0,whisper,,no-baseline,,Don't forget to add your name. It's 11 o'clock.,spk1007_s3.0_whisper.wav
+spk1007,sad,0.3750,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a jacket. It's eleven o'clock.,spk1007_sad.wav
+spk1007,whisper,0.5000,baseline,Don't forget a jacket. It's a leather cloth.,Don't forget a giant. It's eleven o'clock.,spk1007_whisper.wav
+spk1023,anger,0.2857,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's a lot of clock. I'm on my way to the meeting.,spk1023_anger.wav
+spk1023,baseline,0.0000,self,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_baseline.wav
+spk1023,confused,0.4286,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget to check it. Just another month. I'm on my way to the meeting.,spk1023_confused.wav
+spk1023,disgust,0.4286,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. Take a lot of the clock. I'm on my way to the meeting.,spk1023_disgust.wav
+spk1023_dp01,anger,0.1250,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. It's a lot of clock. I'm on my way to the meeting.,spk1023_dp01_anger.wav
+spk1023_dp01,baseline,0.0000,self,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,spk1023_dp01_baseline.wav
+spk1023_dp01,confused,0.1250,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket Take a lot of luck I'm on my way to the meeting,spk1023_dp01_confused.wav
+spk1023_dp01,disgust,0.3125,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_dp01_disgust.wav
+spk1023_dp01,enunciated,0.3750,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a gender It's 11 o'clock I'm on my way to the meeting,spk1023_dp01_enunciated.wav
+spk1023_dp01,fear,0.3125,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. Chitz'e'll at the clock. I'm on my way to the meeting.,spk1023_dp01_fear.wav
+spk1023_dp01,happy,0.3125,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_dp01_happy.wav
+spk1023_dp01,neutral,0.1875,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. Tastes a lot of the clock. I'm on my way to the meeting.,spk1023_dp01_neutral.wav
+spk1023_dp01,sad,0.1250,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. It's a lot of clock. I'm on my way to the meeting.,spk1023_dp01_sad.wav
+spk1023_dp01,whisper,0.5000,baseline,Don't forget a jacket. Takes a lot of time. I'm on my way to the meeting.,Don't forget a jacket. It's a lot of luck. I bought my late emitting.,spk1023_dp01_whisper.wav
+spk1023,enunciated,0.2143,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget to jagged. Take 11 o'clock. I'm on my way to the meeting.,spk1023_enunciated.wav
+spk1023,fear,0.2857,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. Title of the clock. I'm on my way to the meeting.,spk1023_fear.wav
+spk1023,happy,0.0000,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_happy.wav
+spk1023,neutral,0.4286,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. Tastes a lot of the clock. I'm on my way to the meeting.,spk1023_neutral.wav
+spk1023_s2.0,anger,,no-baseline,,Don't forget a jacket. It's a lot of the clock. I'm on my way to the meeting.,spk1023_s2.0_anger.wav
+spk1023_s2.0,confused,,no-baseline,,Don't forget the Janet Take the other car I'm on my way to the meeting,spk1023_s2.0_confused.wav
+spk1023_s2.0,disgust,,no-baseline,,Don't forget a jacket. Tits a lot of the clock. I'm on my way to the meeting.,spk1023_s2.0_disgust.wav
+spk1023_s2.0,enunciated,,no-baseline,,Don't forget to check it. It's 11 o'clock. I'm on my way to the meeting.,spk1023_s2.0_enunciated.wav
+spk1023_s2.0,fear,,no-baseline,,Don't forget a jacket. Tickle of the clock. I'm on my way to the meeting.,spk1023_s2.0_fear.wav
+spk1023_s2.0,happy,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_s2.0_happy.wav
+spk1023_s2.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_s2.0_neutral.wav
+spk1023_s2.0,sad,,no-baseline,,Don't forget a jacket. Tits a lot more clock. I'm on my way to the meeting.,spk1023_s2.0_sad.wav
+spk1023_s2.0,whisper,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm off my way to the meeting.,spk1023_s2.0_whisper.wav
+spk1023_s3.0,anger,,no-baseline,,Don't forget a jacket. It's a lot of clock. I'm on my way to the meeting.,spk1023_s3.0_anger.wav
+spk1023_s3.0,confused,,no-baseline,,Don't forget a jacket Take a lot of love I'm on my way to the meeting,spk1023_s3.0_confused.wav
+spk1023_s3.0,disgust,,no-baseline,,Don't forget a jacket. Tastes a lot of the clock. I'm on my way to the meeting.,spk1023_s3.0_disgust.wav
+spk1023_s3.0,enunciated,,no-baseline,,Don't forget the jamming takes 11 o'clock. I'm on my way to the meeting.,spk1023_s3.0_enunciated.wav
+spk1023_s3.0,fear,,no-baseline,,Don't forget a jacket. Table of a clock. I'm on my way to the meeting.,spk1023_s3.0_fear.wav
+spk1023_s3.0,happy,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1023_s3.0_happy.wav
+spk1023_s3.0,neutral,,no-baseline,,Don't forget a jacket. It's eleven o'clock. I'm on my way to the meeting.,spk1023_s3.0_neutral.wav
+spk1023_s3.0,sad,,no-baseline,,Don't forget a jacket. Tits 11 o'clock. I'm on my way to the meeting.,spk1023_s3.0_sad.wav
+spk1023_s3.0,whisper,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm off my way to the meeting.,spk1023_s3.0_whisper.wav
+spk1023,sad,0.4286,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. Tits a lot on the clock. I'm on my way to the meeting.,spk1023_sad.wav
+spk1023,whisper,0.6429,baseline,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget to put your jacket. It's a lot of luck. I'm off my way to the beach.,spk1023_whisper.wav
+spk1045,anger,0.1429,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the Meegee.,spk1045_anger.wav
+spk1045,baseline,0.0000,self,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_baseline.wav
+spk1045,confused,0.3571,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's a red milk cloth. I'm on my way to the meeting.,spk1045_confused.wav
+spk1045,disgust,0.2857,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget the jacket. It's a late night. I'm on my way to the knee.,spk1045_disgust.wav
+spk1045_dp01,anger,0.3571,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget a jacket. It's a Latin McLaughth. I'm on my way to the meaty.,spk1045_dp01_anger.wav
+spk1045_dp01,baseline,0.0000,self,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,spk1045_dp01_baseline.wav
+spk1045_dp01,confused,0.3571,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget a jacket. It's a Latin o'clock. I'm not my way to the meeting.,spk1045_dp01_confused.wav
+spk1045_dp01,disgust,0.1429,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the Jackets. It's 11 o'clock. I'm on my way to the Mehee.,spk1045_dp01_disgust.wav
+spk1045_dp01,enunciated,0.2143,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget a jacket. It's 11 o'clock. I'm on that way to the meeting,spk1045_dp01_enunciated.wav
+spk1045_dp01,fear,0.3571,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget a jacket. It's a LED McClub. I'm on my way to the meaty.,spk1045_dp01_fear.wav
+spk1045_dp01,happy,0.0714,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_dp01_happy.wav
+spk1045_dp01,neutral,0.0714,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_dp01_neutral.wav
+spk1045_dp01,sad,0.0000,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the Jacket. It's 11 o'clock. I'm on my way to the knee.,spk1045_dp01_sad.wav
+spk1045_dp01,whisper,0.5000,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the knee.,Don't forget the jacket. It's a leopard cloth. I bought that way to the meat.,spk1045_dp01_whisper.wav
+spk1045,enunciated,0.5714,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a German. Get to 11 o'clock. I got away to the meeting.,spk1045_enunciated.wav
+spk1045,fear,0.2857,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget the jacket. It's a LED McLaughlin. I'm on their way to the meeting.,spk1045_fear.wav
+spk1045,happy,0.1429,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on their way to the meeting.,spk1045_happy.wav
+spk1045,neutral,0.1429,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget a jacket. It's 11 o'clock. I'm on my way to the knee.,spk1045_neutral.wav
+spk1045_s2.0,anger,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on that way to the Mehe.,spk1045_s2.0_anger.wav
+spk1045_s2.0,confused,,no-baseline,,Don't forget Ajaka. It's 11 o'clock. I'm on their way to the meeting.,spk1045_s2.0_confused.wav
+spk1045_s2.0,disgust,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s2.0_disgust.wav
+spk1045_s2.0,enunciated,,no-baseline,,Don't forget that January. It's 11 o'clock. I'm on their way to the meeting.,spk1045_s2.0_enunciated.wav
+spk1045_s2.0,fear,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on their way to the meeting.,spk1045_s2.0_fear.wav
+spk1045_s2.0,happy,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on their way to the meeting.,spk1045_s2.0_happy.wav
+spk1045_s2.0,neutral,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s2.0_neutral.wav
+spk1045_s2.0,sad,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s2.0_sad.wav
+spk1045_s2.0,whisper,,no-baseline,,Don't forget the chat room. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s2.0_whisper.wav
+spk1045_s3.0,anger,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on that way to the me.,spk1045_s3.0_anger.wav
+spk1045_s3.0,confused,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s3.0_confused.wav
+spk1045_s3.0,disgust,,no-baseline,,Don't forget the jacket. It's the LED clock. I'm on my way to the meeting.,spk1045_s3.0_disgust.wav
+spk1045_s3.0,enunciated,,no-baseline,,Don't forget that January. It's 11 o'clock. I'm going away to the meeting.,spk1045_s3.0_enunciated.wav
+spk1045_s3.0,fear,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s3.0_fear.wav
+spk1045_s3.0,happy,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on their way to the meeting.,spk1045_s3.0_happy.wav
+spk1045_s3.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_s3.0_neutral.wav
+spk1045_s3.0,sad,,no-baseline,,Don't forget the jacket. It's 11 o'clock. I'm on that way to the meeting.,spk1045_s3.0_sad.wav
+spk1045_s3.0,whisper,,no-baseline,,Don't forget the Japanese. It's 11 o'clock. I'm on that way to the meeting.,spk1045_s3.0_whisper.wav
+spk1045,sad,0.0000,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,spk1045_sad.wav
+spk1045,whisper,0.2143,baseline,Don't forget the jacket. It's 11 o'clock. I'm on my way to the meeting.,Y'all forget the Japanese. It's 11 o'clock. I'm on that way to the meeting.,spk1045_whisper.wav
+spk1076,anger,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_anger.wav
+spk1076,baseline,0.0000,self,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_baseline.wav
+spk1076,confused,0.4286,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 in a clock.,spk1076_confused.wav
+spk1076,disgust,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_disgust.wav
+spk1076_dp01,anger,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_anger.wav
+spk1076_dp01,baseline,0.0000,self,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_baseline.wav
+spk1076_dp01,confused,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_confused.wav
+spk1076_dp01,disgust,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_disgust.wav
+spk1076_dp01,enunciated,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_enunciated.wav
+spk1076_dp01,fear,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_fear.wav
+spk1076_dp01,happy,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_happy.wav
+spk1076_dp01,neutral,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_neutral.wav
+spk1076_dp01,sad,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_dp01_sad.wav
+spk1076_dp01,whisper,0.2857,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget to add a jacket. It's 11 o'clock.,spk1076_dp01_whisper.wav
+spk1076,enunciated,0.4286,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget to check it. It's 11 o'clock.,spk1076_enunciated.wav
+spk1076,fear,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_fear.wav
+spk1076,happy,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_happy.wav
+spk1076,neutral,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_neutral.wav
+spk1076_s2.0,anger,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_anger.wav
+spk1076_s2.0,confused,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_confused.wav
+spk1076_s2.0,disgust,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_disgust.wav
+spk1076_s2.0,enunciated,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_enunciated.wav
+spk1076_s2.0,fear,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_fear.wav
+spk1076_s2.0,happy,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_happy.wav
+spk1076_s2.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_neutral.wav
+spk1076_s2.0,sad,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_sad.wav
+spk1076_s2.0,whisper,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s2.0_whisper.wav
+spk1076_s3.0,anger,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_anger.wav
+spk1076_s3.0,confused,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_confused.wav
+spk1076_s3.0,disgust,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_disgust.wav
+spk1076_s3.0,enunciated,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_enunciated.wav
+spk1076_s3.0,fear,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_fear.wav
+spk1076_s3.0,happy,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_happy.wav
+spk1076_s3.0,neutral,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_neutral.wav
+spk1076_s3.0,sad,,no-baseline,,Don't forget a jacket. It's 11 o'clock.,spk1076_s3.0_sad.wav
+spk1076_s3.0,whisper,,no-baseline,,Don't forget to check it. It's 11 o'clock.,spk1076_s3.0_whisper.wav
+spk1076,sad,0.0000,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget a jacket. It's 11 o'clock.,spk1076_sad.wav
+spk1076,whisper,0.4286,baseline,Don't forget a jacket. It's 11 o'clock.,Don't forget to attract it. It's 11 o'clock.,spk1076_whisper.wav
+trump,anger,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_anger.wav
+trump,baseline,0.0000,self,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_baseline.wav
+trump,confused,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_confused.wav
+trump,disgust,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_disgust.wav
+trump_dp01,anger,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_anger.wav
+trump_dp01,baseline,0.0000,self,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_baseline.wav
+trump_dp01,confused,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_confused.wav
+trump_dp01,disgust,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_disgust.wav
+trump_dp01,enunciated,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_enunciated.wav
+trump_dp01,fear,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_fear.wav
+trump_dp01,happy,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_happy.wav
+trump_dp01,neutral,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_neutral.wav
+trump_dp01,sad,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_dp01_sad.wav
+trump_dp01,whisper,0.2000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make a Mac again for all Macs.",trump_dp01_whisper.wav
+trump,enunciated,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_enunciated.wav
+trump,fear,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_fear.wav
+trump,happy,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_happy.wav
+trump_male,anger,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_anger.wav
+trump_male,baseline,0.0000,self,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_baseline.wav
+trump_male,confused,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_confused.wav
+trump_male,disgust,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_disgust.wav
+trump_male,enunciated,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_enunciated.wav
+trump_male,fear,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_fear.wav
+trump_male,happy,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_happy.wav
+trump_male,neutral,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_neutral.wav
+trump_male,sad,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_sad.wav
+trump_male,whisper,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_male_whisper.wav
+trump,neutral,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_neutral.wav
+trump_s10.0,anger,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_anger.wav
+trump_s10.0,confused,,no-baseline,,"In sugar, we embarked on a mission to make America great again, serve all Americans.",trump_s10.0_confused.wav
+trump_s10.0,disgust,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_disgust.wav
+trump_s10.0,enunciated,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_enunciated.wav
+trump_s10.0,fear,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_fear.wav
+trump_s10.0,happy,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_happy.wav
+trump_s10.0,neutral,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_neutral.wav
+trump_s10.0,sad,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s10.0_sad.wav
+trump_s10.0,whisper,,no-baseline,,"In short, we embarked on the mission to make America great again for all Americans.",trump_s10.0_whisper.wav
+trump_s2.0,anger,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_anger.wav
+trump_s2.0,confused,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_confused.wav
+trump_s2.0,disgust,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_disgust.wav
+trump_s2.0,enunciated,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_enunciated.wav
+trump_s2.0,fear,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_fear.wav
+trump_s2.0,happy,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_happy.wav
+trump_s2.0,neutral,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_neutral.wav
+trump_s2.0,sad,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_sad.wav
+trump_s2.0,whisper,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s2.0_whisper.wav
+trump_s20.0,anger,,no-baseline,,"In short, we embarked on a mission to make America great and for all Americans.",trump_s20.0_anger.wav
+trump_s20.0,confused,,no-baseline,,"In sugar, we embarked on a mission to make America great again, to all Americans.",trump_s20.0_confused.wav
+trump_s20.0,disgust,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_disgust.wav
+trump_s20.0,enunciated,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_enunciated.wav
+trump_s20.0,fear,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_fear.wav
+trump_s20.0,happy,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_happy.wav
+trump_s20.0,neutral,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_neutral.wav
+trump_s20.0,sad,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s20.0_sad.wav
+trump_s20.0,whisper,,no-baseline,,"In short, we're marked on a mission to make America great again for all Americans.",trump_s20.0_whisper.wav
+trump_s3.0,anger,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_anger.wav
+trump_s3.0,confused,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_confused.wav
+trump_s3.0,disgust,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_disgust.wav
+trump_s3.0,enunciated,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_enunciated.wav
+trump_s3.0,fear,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_fear.wav
+trump_s3.0,happy,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_happy.wav
+trump_s3.0,neutral,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_neutral.wav
+trump_s3.0,sad,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_sad.wav
+trump_s3.0,whisper,,no-baseline,,"In short, we embarked on a mission to make America great again for all Americans.",trump_s3.0_whisper.wav
+trump,sad,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_sad.wav
+trump,whisper,0.0000,baseline,"In short, we embarked on a mission to make America great again for all Americans.","In short, we embarked on a mission to make America great again for all Americans.",trump_whisper.wav