Skip to content

Training custom models in 2026: dependency fixes + Dockerized and Native solution #317

@briankelley

Description

@briankelley

(edited title to indicate a native script is included in the referenced repo.)

First off, thank you for building this. OpenWakeWord is the reason I have a working custom wake word system running on my machine right now instead of being stuck with a cloud service or someone else's trigger phrase. The project is genuinely useful and I'm grateful it exists.

That said, I spent the better part of a week getting the training pipeline to work on a modern Linux system (Ubuntu-based, Python 3.12 default, CUDA 12.x). The dependency stack has aged out of compatibility with current distros and package versions, and there's no documentation for what breaks or how to fix it. I wanted to share what I found in case it helps others (or you, if you ever revisit this project).

Every issue below is in the upstream dependency stack, not in your code:

  1. torch==1.13.1 - No wheels exist for Python 3.12+. Requires Python 3.10 or 3.11 (via deadsnakes PPA on Ubuntu).
  2. pyarrow - Newer versions broke the datasets library API (PyExtensionType removed). Fix: pin pyarrow<15.0.0.
  3. fsspec - Newer versions broke datasets glob patterns. Fix: pin fsspec<2024.1.0.
  4. webrtcvad - Requires C compilation but build-essential and python3.X-dev aren't listed as dependencies.
  5. python3.10-venv - Package name is version-specific on Ubuntu. Scripts that install python3-venv get the wrong one.
  6. HuggingFace .cache directories - hf_hub_download leaves .cache/ directories alongside downloaded files. The training code tries to load them as audio and crashes.
  7. MIT RIR 16khz/ subdirectory - Downloaded RIR files are nested in a subdirectory the training code doesn't expect.
  8. MIT RIR sample rate - Original files from MIT are 32kHz. Training expects 16kHz. Requires conversion with ffmpeg.
  9. Docker shared memory - PyTorch DataLoader needs --shm-size=32g or workers get killed with "No space left on device."
  10. HuggingFace rate limiting - Downloading training data as individual files (tens of thousands of requests) triggers rate limits. Solved by packaging as a single tarball.
  11. Training segfault on cleanup - train.py segfaults after saving the model. The model file is fine - the crash happens during cleanup. Harmless but scary.
  12. Python output buffering in Docker - Progress output doesn't appear without PYTHONUNBUFFERED=1.

What I built

Rather than just documenting the fixes, I containerized the entire training pipeline in Docker, as well as a native linux install, so the fragile dependency stack is frozen and isolated:

  • Dockerfile with every package pinned to known working versions (torch 1.13.1+cu117, tensorflow-cpu 2.8.1, datasets 2.14.4, etc.)
  • Interactive wrapper script that walks users through wake word selection, training settings, and launches the container
  • Custom wake word support - any phrase, not hardcoded
  • Training data hosted on HuggingFace as a single ~20GB tarball (avoids rate limiting)
  • Empirical testing data from multiple training runs comparing sample counts, augmentation rounds, neuron depth, and single vs. two-word phrases

The repo is here: https://github.com/briankelley/atlas-voice-training/

The Dockerfile pins your repo to commit 368c037 and piper-sample-generator to commit f1988a4.

Empirical findings (probably not useful for you since the default options produced the highest quality models already. I'm sure this wasn't an accident.)

Ran 5+ training configurations on both an RTX 3060 laptop and RTX 4090 desktop:

Wake Word Samples Config Accuracy Recall FP/hr
"Hey Atlas" 50k 2 aug, 32n 81.10% 62.48% 2.12
"Hey Atlas" 100k 2 aug, 32n 77.47% 55.08% 0.62
"Atlas" 50k 3 aug, 32n 71.64% 43.54% 2.57
"Atlas" 50k 2 aug, 64n 71.94% 44.04% 2.48
"Globe Master" 50k 2 aug, 32n 81.07% 62.20% 1.24

Not asking for anything specific

I understand projects age and maintainers move on. I'm not requesting changes - just sharing what I ran into and what I did about it in case it saves someone else the same week of debugging. If any of this is useful to you or the project, happy to help however I can.

Thanks again for OpenWakeWord. It's a great piece of work that got me started on the implementation I'm using now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions