Comprehensive reference for the vsep audio stem separation library. This document covers the Python API, CLI interface, configuration system, ensemble system, remote API client, and architecture comparison.
- Overview
- Separator Class
- CLI Reference
- Configuration Reference
- Ensemble System
- Remote API Client
- Architecture Comparison
vsep is an AI-powered audio stem separator that supports multiple neural network architectures for splitting audio into individual components such as vocals, drums, bass, and other instruments. It provides both a Python API for programmatic use and a command-line interface for batch processing.
| Architecture | Description | Model Format | Backend |
|---|---|---|---|
| MDX-Net | Open-unmix based architecture using multi-band decomposition | .onnx |
ONNX Runtime |
| VR Band Split | Vision-Roadmap band-split RNN model | .onnx |
ONNX Runtime |
| Demucs v4 | Facebook Research hybrid transformer model (v4 only) | .th + .yaml |
PyTorch |
| MDXC / Roformer | MDX23C and Roformer attention-based models | .ckpt + .yaml |
PyTorch |
from separator import Separator
# Create a separator instance with default model
separator = Separator(output_dir="./output", output_format="FLAC")
# Load the default model (downloads automatically on first use)
separator.load_model("model_bs_roformer_ep_317_sdr_12.9755.ckpt")
# Separate an audio file
output_files = separator.separate("song.mp3")
print(f"Output files: {output_files}")# CLI quick start
python utils/cli.py song.mp3 -m model_bs_roformer_ep_317_sdr_12.9755.ckpt --output_format FLACFor more details on installation and usage, see the main README.
The Separator class is the primary entry point for audio separation. It manages model loading, hardware configuration, and the separation pipeline.
Import:
from separator import SeparatorSeparator(
log_level=logging.INFO,
log_formatter=None,
model_file_dir="/tmp/audio-separator-models/",
output_dir=None,
output_format="WAV",
output_bitrate=None,
normalization_threshold=0.9,
amplification_threshold=0.0,
output_single_stem=None,
invert_using_spec=False,
sample_rate=44100,
use_soundfile=False,
use_autocast=False,
use_directml=False,
chunk_duration=None,
mdx_params={...},
vr_params={...},
demucs_params={...},
mdxc_params={...},
ensemble_algorithm=None,
ensemble_weights=None,
ensemble_preset=None,
info_only=False,
)| Parameter | Type | Default | Description |
|---|---|---|---|
log_level |
int |
logging.INFO |
Logging level (e.g., logging.DEBUG, logging.INFO, logging.WARNING) |
log_formatter |
logging.Formatter |
None |
Custom log formatter. If None, uses %(asctime)s - %(levelname)s - %(module)s - %(message)s |
model_file_dir |
str |
"/tmp/audio-separator-models/" |
Directory where model files are stored. Overridden by VSEP_MODEL_DIR or AUDIO_SEPARATOR_MODEL_DIR environment variable if set |
output_dir |
str or None |
None |
Directory for output files. If None, uses the current working directory |
output_format |
str |
"WAV" |
Output audio format. Common values: "WAV", "FLAC", "MP3", "OGG" |
output_bitrate |
str or None |
None |
Output bitrate for lossy formats (e.g., "320k" for MP3). Only used when output_format is a lossy format |
normalization_threshold |
float |
0.9 |
Max peak amplitude to normalize audio to. Must be in range (0, 1] |
amplification_threshold |
float |
0.0 |
Min peak amplitude to amplify audio to. Must be in range [0, 1]. Disabled by default |
output_single_stem |
str or None |
None |
If set, only output this stem (e.g., "Instrumental", "Vocals", "Drums") |
invert_using_spec |
bool |
False |
If True, invert the secondary stem using spectrogram instead of waveform. Slightly slower but may improve quality |
sample_rate |
int |
44100 |
Output sample rate in Hz. Must be a positive integer less than 12,800,000 |
use_soundfile |
bool |
False |
If True, use soundfile for audio writing instead of pydub. Can help with OOM issues |
use_autocast |
bool |
False |
If True, use PyTorch autocast for faster inference. Do not use for CPU inference |
use_directml |
bool |
False |
If True, attempt to use DirectML acceleration (Windows only, requires torch_directml package) |
chunk_duration |
float or None |
None |
Split audio into chunks of this duration in seconds. Chunks are concatenated without overlap/crossfade. Useful for processing very long audio files on systems with limited memory |
ensemble_algorithm |
str or None |
None |
Algorithm for ensembling multiple models. Defaults to "avg_wave" if not set. See Ensemble System for all options |
ensemble_weights |
list or None |
None |
Per-model weights for ensembling. Must match the number of models. Equal weights used if None |
ensemble_preset |
str or None |
None |
Named ensemble preset (e.g., "vocal_balanced"). Presets define models, algorithm, and optional weights |
info_only |
bool |
False |
If True, skip hardware setup and initialization logging. Useful for listing models without loading GPU |
These parameters are passed as dictionaries to the constructor:
MDX Parameters (mdx_params)
| Key | Type | Default | Description |
|---|---|---|---|
hop_length |
int |
1024 |
Hop length for STFT. Usually called stride in neural networks |
segment_size |
int |
256 |
Segment size for processing. Larger consumes more resources but may give better results |
overlap |
float |
0.25 |
Amount of overlap between prediction windows, range 0.001-0.999. Higher is better but slower |
batch_size |
int |
1 |
Batch size for processing. Larger consumes more RAM but may be slightly faster |
enable_denoise |
bool |
False |
Enable denoising during separation |
VR Parameters (vr_params)
| Key | Type | Default | Description |
|---|---|---|---|
batch_size |
int |
1 |
Number of batches to process at a time |
window_size |
int |
512 |
Window size. Balance quality and speed: 1024 = fast but lower quality, 320 = slower but better |
aggression |
int |
5 |
Intensity of primary stem extraction, range -100 to 100. Typically 5 for vocals and instrumentals |
enable_tta |
bool |
False |
Enable Test-Time Augmentation. Slow but improves quality |
enable_post_process |
bool |
False |
Identify leftover artifacts within vocal output. May improve separation for some songs |
post_process_threshold |
float |
0.2 |
Threshold for post-processing feature, range 0.1-0.3 |
high_end_process |
bool |
False |
Mirror the missing frequency range of the output |
Demucs Parameters (demucs_params)
| Key | Type | Default | Description |
|---|---|---|---|
segment_size |
str |
"Default" |
Size of segments for processing, range 1-100. Higher = slower but better quality |
shifts |
int |
2 |
Number of predictions with random shifts. Higher = slower but better quality |
overlap |
float |
0.25 |
Overlap between prediction windows, range 0.001-0.999. Higher = slower but better quality |
segments_enabled |
bool |
True |
Enable segment-wise processing |
MDXC Parameters (mdxc_params)
| Key | Type | Default | Description |
|---|---|---|---|
segment_size |
int |
256 |
Segment size for processing. Larger consumes more resources but may give better results |
override_model_segment_size |
bool |
False |
Override the model's default segment size instead of using the value stored in the model config |
batch_size |
int |
1 |
Batch size for processing. Larger consumes more RAM but may be slightly faster |
overlap |
int |
8 |
Overlap between prediction windows, range 2-50. Higher is better but slower |
pitch_shift |
int |
0 |
Shift audio pitch by this many semitones while processing. May improve output for deep/high vocals |
| Variable | Description |
|---|---|
VSEP_MODEL_DIR |
Override model_file_dir parameter. Path to model storage directory |
AUDIO_SEPARATOR_MODEL_DIR |
Legacy equivalent of VSEP_MODEL_DIR (still supported) |
Perform audio source separation on one or more audio files.
def separate(self, audio_file_path, custom_output_names=None) -> list[str]| Parameter | Type | Default | Description |
|---|---|---|---|
audio_file_path |
str or list[str] |
required | Path to an audio file, a directory of audio files, or a list of paths |
custom_output_names |
dict[str, str] or None |
None |
Mapping of stem names to custom output filenames (e.g., {"Vocals": "my_vocals"}) |
Returns: list[str] -- List of file paths to the separated audio stem files.
Raises: ValueError if model not loaded or initialization failed.
Supported audio formats: .wav, .flac, .mp3, .ogg, .opus, .m4a, .aiff, .ac3
When audio_file_path is a directory, all audio files within it (recursively) are processed.
When chunk_duration is set and the file exceeds that duration, the audio is automatically split into chunks, processed separately, and merged back together.
# Separate a single file
output_files = separator.separate("input/song.mp3")
# Separate multiple files
output_files = separator.separate(["song1.mp3", "song2.flac"])
# Separate a directory of audio files
output_files = separator.separate("input/album/")
# With custom output names
output_files = separator.separate(
"song.mp3",
custom_output_names={"Vocals": "lead_vocal", "Instrumental": "backing_track"}
)Download (if needed) and load a separation model into memory.
def load_model(self, model_filename="model_bs_roformer_ep_317_sdr_12.9755.ckpt") -> None| Parameter | Type | Default | Description |
|---|---|---|---|
model_filename |
str or list[str] |
"model_bs_roformer_ep_317_sdr_12.9755.ckpt" |
Model filename or list of model filenames for ensembling |
Returns: None
Raises: ValueError if model file not found in supported models list, or model type not supported. Exception if using Demucs with Python < 3.10.
When a list of filenames is provided (more than one model), the separator operates in ensemble mode. Each model is loaded and run sequentially, and the results are combined using the configured ensemble algorithm.
If an ensemble preset was configured and no explicit model list is given, the preset's models are used automatically.
# Load default model
separator.load_model()
# Load a specific model
separator.load_model("MDX23C-8KFFT-InstVoc_HQ.ckpt")
# Load multiple models for ensembling
separator.load_model([
"model_bs_roformer_ep_317_sdr_12.9755.ckpt",
"MDX23C-8KFFT-InstVoc_HQ.ckpt"
])Download model files and associated data without loading the model into memory.
def download_model_and_data(self, model_filename) -> None| Parameter | Type | Default | Description |
|---|---|---|---|
model_filename |
str |
required | Filename of the model to download |
Returns: None
# Pre-download a model for later use
separator.download_model_and_data("htdemucs_ft.yaml")Download the model files for a given model filename.
def download_model_files(self, model_filename) -> tuple| Parameter | Type | Default | Description |
|---|---|---|---|
model_filename |
str |
required | Filename of the model to download |
Returns: tuple[str, str, str, str, str or None] -- A tuple of (model_filename, model_type, model_friendly_name, model_path, yaml_config_filename).
model_typeis one of"MDX","VR","Demucs", or"MDXC"yaml_config_filenameisNonefor MDX and VR models (which use hash-based parameter lookup)
Raises: ValueError if the model filename is not found in the supported model list.
Files are downloaded in parallel (up to 4 concurrent workers) with automatic fallback to an alternate repository if the primary source fails.
List all supported model files with performance scores and download information.
def list_supported_model_files(self) -> dictReturns: dict -- A nested dictionary grouped by architecture type. Each model entry contains:
| Key | Type | Description |
|---|---|---|
filename |
str |
Primary model filename |
scores |
dict |
Performance scores (SDR, SIR, SAR, ISR) per stem |
stems |
list[str] |
List of output stems this model produces |
target_stem |
str |
The primary target stem |
download_files |
list[str] |
List of filenames or URLs to download |
The returned dict is keyed by architecture: "VR", "MDX", "Demucs", "MDXC".
models = separator.list_supported_model_files()
for arch_type, arch_models in models.items():
print(f"\n{arch_type} Models:")
for name, info in arch_models.items():
print(f" {name}: {info['filename']}")
if info['scores']:
for stem, scores in info['scores'].items():
print(f" {stem} SDR: {scores.get('SDR', 'N/A')}")List all available ensemble presets.
def list_ensemble_presets(self) -> dictReturns: dict -- A dictionary mapping preset IDs to their full preset data. Each preset contains name, description, models, algorithm, weights (optional), and contributor.
presets = separator.list_ensemble_presets()
for preset_id, preset in presets.items():
print(f"{preset_id}: {preset['description']} "
f"({len(preset['models'])} models, algorithm: {preset['algorithm']})")Calculate the MD5 hash of a model file.
def get_model_hash(self, model_path) -> str| Parameter | Type | Default | Description |
|---|---|---|---|
model_path |
str |
required | Path to the model file |
Returns: str -- The MD5 hash of the model file (hex digest).
For files larger than 10 MB, only the last 10 MB are hashed (seeking to the end minus 10 MB). This is the same hashing strategy used by UVR to identify model parameters.
Raises: FileNotFoundError if the model file does not exist.
hash_value = separator.get_model_hash("/tmp/vsep-models/model.ckpt")
print(f"Model hash: {hash_value}")Configure hardware acceleration for PyTorch and ONNX Runtime.
def setup_accelerated_inferencing_device(self) -> NoneReturns: None
This method is called automatically during initialization (unless info_only=True). It probes the system for available acceleration backends in this order:
- CUDA (NVIDIA GPU) -- sets
torch_deviceto"cuda"and ONNX provider toCUDAExecutionProvider - MPS/CoreML (Apple Silicon, ARM only) -- sets
torch_deviceto"mps"and ONNX provider toCoreMLExecutionProvider - DirectML (Windows, if
use_directml=True) -- setstorch_deviceto DirectML device and ONNX provider toDmlExecutionProvider - CPU (fallback) -- sets
torch_deviceto"cpu"and ONNX provider toCPUExecutionProvider
After calling this method, self.torch_device and self.onnx_execution_provider are populated with the selected device and provider.
Return a simplified, user-friendly model list with sorting and filtering.
def get_simplified_model_list(self, filter_sort_by=None) -> dict| Parameter | Type | Default | Description |
|---|---|---|---|
filter_sort_by |
str or None |
None |
Sort/filter criteria: "name", "filename", or a stem name like "vocals", "drums", etc. |
Returns: dict -- Dictionary keyed by model filename, with values containing Name, Type, Stems (with SDR scores), and SDR dictionary.
When filtering by a stem name, only models that produce that stem are returned, sorted by SDR score (highest first).
# Get all models sorted by name
models = separator.get_simplified_model_list(filter_sort_by="name")
# Get only vocal models sorted by vocal SDR
vocal_models = separator.get_simplified_model_list(filter_sort_by="vocals")These methods are used internally but may be useful for advanced users:
| Method | Description |
|---|---|
get_system_info() |
Log and return system information (OS, CPU, Python version) |
check_ffmpeg_installed() |
Verify FFmpeg is installed and log its version |
log_onnxruntime_packages() |
Log installed ONNX Runtime packages (GPU, Silicon, CPU, DirectML) |
get_package_distribution(package_name) |
Return package distribution object if installed, None otherwise |
download_file_if_not_exists(url, output_path) |
Download a file with resume support and progress bar |
load_model_data_from_yaml(yaml_config_filename) |
Load model parameters from a YAML config file |
load_model_data_using_hash(model_path) |
Load model parameters by computing file hash and looking up in UVR data |
The CLI is invoked via python utils/cli.py and provides full access to all separation capabilities from the command line.
usage: vsep [-h] [audio_files ...]
| Argument | Short | Type | Default | Description |
|---|---|---|---|
| --version | -v | flag | -- | Show program version number and exit |
| --debug | -d | flag | False | Enable debug logging (equivalent to --log_level=debug) |
| --env_info | -e | flag | False | Print environment information and exit |
| --list_models | -l | flag | False | List all supported models and exit |
| --log_level | -- | str | "info" | Log level: info, debug, warning |
| --list_filter | -- | str | None | Filter/sort model list by name, filename, or any stem name |
| --list_limit | -- | int | None | Limit the number of models shown |
| --list_format | -- | str | "pretty" | Format for listing models: pretty or json |
# List all models as JSON
python utils/cli.py --list_models --list_format json
# Show top 10 vocal models sorted by SDR
python utils/cli.py --list_models --list_filter vocals --list_limit 10
# Print environment info (GPU, ONNX Runtime, etc.)
python utils/cli.py --env_info| Argument | Short | Type | Default | Description |
|---|---|---|---|
| --model_filename | -m | str | "model_bs_roformer_ep_317_sdr_12.9755.ckpt" | Primary model to use for separation |
| --extra_models | -- | list[str] | None | Additional models for ensembling. Requires -m for the primary model |
| --output_format | -- | str | "FLAC" | Output format: WAV, FLAC, MP3, OGG, etc. |
| --output_bitrate | -- | str | None | Output bitrate for lossy formats (e.g., 320k) |
| --output_dir | -- | str | None | Output directory (default: current directory) |
| --model_file_dir | -- | str | "/tmp/vsep-models/" | Model files directory (overridden by VSEP_MODEL_DIR env var) |
| --download_model_only | -- | flag | False | Download model only, without performing separation |
# Download a model for later use
python utils/cli.py --download_model_only -m MDX23C-8KFFT-InstVoc_HQ.ckpt
# Specify output directory and format
python utils/cli.py song.mp3 -m model.ckpt --output_dir ./separated --output_format MP3 --output_bitrate 320k| Argument | Type | Default | Description |
|---|---|---|---|
--invert_spect |
flag | False |
Invert secondary stem using spectrogram |
--normalization |
float |
0.9 |
Max peak amplitude for normalization (range: 0-1) |
--amplification |
float |
0.0 |
Min peak amplitude for amplification (range: 0-1) |
--single_stem |
str |
None |
Output only a single stem: Instrumental, Vocals, Drums, Bass, Guitar, Piano, Other |
--sample_rate |
int |
44100 |
Output sample rate in Hz |
--use_soundfile |
flag | False |
Use soundfile for audio output (can help with OOM) |
--use_autocast |
flag | False |
Use PyTorch autocast for faster inference (GPU only) |
--chunk_duration |
float |
None |
Split into chunks of N seconds (e.g., 600 for 10-min chunks) |
--ensemble_algorithm |
str |
None |
Ensemble algorithm. Choices: avg_wave, median_wave, min_wave, max_wave, avg_fft, median_fft, min_fft, max_fft, uvr_max_spec, uvr_min_spec, ensemble_wav |
--ensemble_weights |
list[float] |
None |
Per-model weights for ensembling (must match model count) |
--ensemble_preset |
str |
None |
Named ensemble preset (e.g., vocal_balanced, karaoke) |
--list_presets |
flag | False |
List all available ensemble presets and exit |
--custom_output_names |
JSON str |
None |
Custom names for output files (e.g., '{"Vocals": "my_vocals"}') |
# Extract only vocals
python utils/cli.py song.mp3 --single_stem Vocals
# Use ensemble with preset
python utils/cli.py song.mp3 --ensemble_preset vocal_balanced
# Use ensemble with custom models and weights
python utils/cli.py song.mp3 -m model1.ckpt --extra_models model2.onnx model3.ckpt \
--ensemble_algorithm avg_wave --ensemble_weights 1.0 0.5 0.5| Argument | Type | Default | Description |
|---|---|---|---|
--mdx_segment_size |
int |
256 |
Segment size. Larger = more resources, potentially better results |
--mdx_overlap |
float |
0.25 |
Overlap between prediction windows (0.001-0.999). Higher = slower but better |
--mdx_batch_size |
int |
1 |
Batch size. Larger = more RAM, slightly faster |
--mdx_hop_length |
int |
1024 |
Hop length (stride). Only change if you know what you are doing |
--mdx_enable_denoise |
flag | False |
Enable denoising during separation |
python utils/cli.py song.mp3 -m MDX-Net_Model.onnx --mdx_segment_size 512 --mdx_enable_denoise| Argument | Type | Default | Description |
|---|---|---|---|
--vr_batch_size |
int |
1 |
Batches to process at a time |
--vr_window_size |
int |
512 |
Window size: 1024 = fast/low quality, 320 = slow/high quality |
--vr_aggression |
int |
5 |
Intensity of primary stem extraction (-100 to 100) |
--vr_enable_tta |
flag | False |
Enable Test-Time Augmentation (slow but better quality) |
--vr_high_end_process |
flag | False |
Mirror the missing frequency range |
--vr_enable_post_process |
flag | False |
Identify leftover artifacts in vocal output |
--vr_post_process_threshold |
float |
0.2 |
Post-process threshold (0.1-0.3) |
python utils/cli.py song.mp3 -m VR_Model.onnx --vr_aggression 2 --vr_window_size 320| Argument | Type | Default | Description |
|---|---|---|---|
--demucs_segment_size |
str |
"Default" |
Segment size (1-100). Higher = slower but better quality |
--demucs_shifts |
int |
2 |
Number of predictions with random shifts. Higher = slower but better |
--demucs_overlap |
float |
0.25 |
Overlap between prediction windows (0.001-0.999) |
--demucs_segments_enabled |
bool |
True |
Enable segment-wise processing |
python utils/cli.py song.mp3 -m htdemucs_ft.yaml --demucs_shifts 4 --demucs_overlap 0.35| Argument | Type | Default | Description |
|---|---|---|---|
--mdxc_segment_size |
int |
256 |
Segment size. Larger = more resources, potentially better results |
--mdxc_override_model_segment_size |
flag | False |
Override the model's built-in segment size |
--mdxc_overlap |
int |
8 |
Overlap between prediction windows (2-50). Higher = better but slower |
--mdxc_batch_size |
int |
1 |
Batch size. Larger = more RAM, slightly faster |
--mdxc_pitch_shift |
int |
0 |
Shift pitch by N semitones. May help with deep/high vocals |
python utils/cli.py song.mp3 -m MDX23C-8KFFT-InstVoc_HQ.ckpt --mdxc_pitch_shift 2 --mdxc_overlap 16Configuration is defined in config/variables.py and re-exported via config/__init__.py.
| Variable | Type | Value | Description |
|---|---|---|---|
UVR_PUBLIC_REPO_URL |
str |
"https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models" |
Primary UVR model repository (public models) |
UVR_VIP_REPO_URL |
str |
"https://github.com/Anjok0109/ai_magic/releases/download/v5" |
VIP models repository (Anjok07's paid subscriber models) |
AUDIO_SEPARATOR_REPO_URL |
str |
"https://github.com/nomadkaraoke/python-audio-separator/releases/download/model-configs" |
vsep-specific models and config fallback repository |
UVR_MODEL_DATA_URL_PREFIX |
str |
"https://raw.githubusercontent.com/TRvlvr/application_data/main" |
Base URL for UVR model parameter data files |
UVR_VR_MODEL_DATA_URL |
str |
"{prefix}/vr_model_data/model_data_new.json" |
URL for VR model parameter lookup data |
UVR_MDX_MODEL_DATA_URL |
str |
"{prefix}/mdx_model_data/model_data_new.json" |
URL for MDX model parameter lookup data |
| Variable | Type | Value | Description |
|---|---|---|---|
MDXC_YAML_PATH_PREFIX |
str |
"mdx_model_data/mdx_c_configs" |
Path prefix for MDXC YAML config files within the UVR repository |
| Variable | Type | Value | Description |
|---|---|---|---|
MAX_DOWNLOAD_WORKERS |
int |
4 |
Maximum number of parallel download threads |
DOWNLOAD_CHUNK_SIZE |
int |
262144 |
Download chunk size in bytes (256 KB) |
DOWNLOAD_TIMEOUT |
int |
300 |
HTTP request timeout in seconds (5 minutes) |
HTTP_POOL_CONNECTIONS |
int |
10 |
Number of connection pool connections |
HTTP_POOL_MAXSIZE |
int |
10 |
Maximum size of the connection pool |
| Variable | Type | Value | Description |
|---|---|---|---|
VR_MODEL_DATA_FILENAME |
str |
"vr_model_data.json" |
Local filename for VR model parameter data |
MDX_MODEL_DATA_FILENAME |
str |
"mdx_model_data.json" |
Local filename for MDX model parameter data |
Returns the appropriate repository URL based on model type.
from config import get_repo_url
# Public model URL
url = get_repo_url(is_vip=False)
# "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models"
# VIP model URL
url = get_repo_url(is_vip=True)
# "https://github.com/Anjok0109/ai_magic/releases/download/v5"| Parameter | Type | Default | Description |
|---|---|---|---|
is_vip |
bool |
False |
If True, return the VIP repository URL |
Returns: str -- The repository URL.
Constructs the full URL for an MDXC YAML config file.
from config import get_mdx_yaml_url
url = get_mdx_yaml_url("model_2_stem_full_band_8k.yaml")
# "https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/mdx_model_data/mdx_c_configs/model_2_stem_full_band_8k.yaml"| Parameter | Type | Default | Description |
|---|---|---|---|
filename |
str |
required | YAML config filename |
Returns: str -- Full URL to the YAML file.
Returns the fallback URL from the audio-separator repository.
from config import get_fallback_url
url = get_fallback_url("some_model.onnx")
# "https://github.com/nomadkaraoke/python-audio-separator/releases/download/model-configs/some_model.onnx"| Parameter | Type | Default | Description |
|---|---|---|---|
filename |
str |
required | Model or config filename |
Returns: str -- Fallback URL string.
The ensemble system allows combining outputs from multiple separation models to produce higher-quality results. Instead of relying on a single model, you can run several models on the same audio and merge their outputs using a configurable algorithm.
- Multiple models are specified (either explicitly or via a preset)
- Each model separates the audio independently
- The intermediate stems are grouped by stem name (e.g., all "Vocals" outputs together)
- The grouped waveforms are merged using the selected ensemble algorithm
- The final merged stems are written as output
All 11 supported ensemble algorithms are defined in the Ensembler class (separator/ensembler.py):
| Algorithm | Category | Description | Supports Weights |
|---|---|---|---|
avg_wave |
Wave | Weighted average of waveforms in the time domain | Yes |
median_wave |
Wave | Median of waveforms (robust to outliers) | No (ignored) |
min_wave |
Wave | Element-wise minimum absolute amplitude (conservative) | No (ignored) |
max_wave |
Wave | Element-wise maximum absolute amplitude (aggressive) | No (ignored) |
avg_fft |
FFT | Weighted average of spectrograms (frequency domain) | Yes |
median_fft |
FFT | Median of spectrogram magnitudes (complex median) | No (ignored) |
min_fft |
FFT | Minimum magnitude spectrogram (conservative in frequency domain) | No (ignored) |
max_fft |
FFT | Maximum magnitude spectrogram (aggressive in frequency domain) | No (ignored) |
uvr_max_spec |
UVR | UVR's maximum spectrogram ensembling algorithm | No (via UVR) |
uvr_min_spec |
UVR | UVR's minimum spectrogram ensembling algorithm | No (via UVR) |
ensemble_wav |
UVR | UVR's legacy waveform ensembling algorithm | No (via UVR) |
Default algorithm: avg_wave
Wave domain algorithms operate directly on the audio waveform samples. They are generally faster than FFT-based methods because they avoid the STFT/ISTFT transform overhead.
avg_wave: Computes a weighted sum of all model waveforms and divides by the total weight. This is the most commonly used algorithm and produces a balanced blend of all model outputs.median_wave: Takes the median value at each sample position across all models. This is robust to outlier models that produce very different results from the consensus.min_wave: At each sample position, selects the waveform with the smallest absolute amplitude. This produces a conservative output that retains only what all models agree on.max_wave: At each sample position, selects the waveform with the largest absolute amplitude. This is aggressive and can retain more detail but may include artifacts.
FFT domain algorithms transform waveforms to the frequency domain via STFT, perform the ensemble operation on the complex spectrograms, then convert back via ISTFT. These can be more musically accurate because they preserve phase relationships.
avg_fft: Weighted average of complex spectrograms. Phase-aware blending.median_fft: Median of real and imaginary parts of the spectrograms separately.min_fft: Selects the spectrogram bin with the minimum magnitude at each frequency/time position.max_fft: Selects the spectrogram bin with the maximum magnitude at each frequency/time position.
These algorithms use UVR's built-in spectrogram utilities for ensembling, originally from the Ultimate Vocal Remover project.
uvr_max_spec: Maximum spectrogram ensembling using UVR'sMAX_SPECalgorithm.uvr_min_spec: Minimum spectrogram ensembling using UVR'sMIN_SPECalgorithm.ensemble_wav: Legacy UVR waveform ensembling.
Weights allow you to give more influence to certain models in the ensemble. Weights are only supported by avg_wave and avg_fft algorithms.
# Give model1 double the influence of model2
separator = Separator(
ensemble_algorithm="avg_wave",
ensemble_weights=[2.0, 1.0]
)
separator.load_model(["model1.ckpt", "model2.ckpt"])# CLI equivalent
python utils/cli.py song.mp3 -m model1.ckpt --extra_models model2.ckpt \
--ensemble_algorithm avg_wave --ensemble_weights 2.0 1.0If weights are not specified, equal weights (all 1.0) are used. If the number of weights does not match the number of models, a warning is logged and equal weights are used instead. Non-finite weights (NaN, infinity) or weights summing to zero also trigger a fallback to equal weights.
Presets bundle a curated selection of models with an appropriate algorithm and optional weights. Presets are defined in ensemble_presets.json (bundled with the package).
# Use a preset
separator = Separator(ensemble_preset="vocal_balanced")
separator.load_model() # Loads the preset's models automatically
output_files = separator.separate("song.mp3")# List available presets
python utils/cli.py --list_presets
# Use a preset
python utils/cli.py song.mp3 --ensemble_preset vocal_balancedEach preset contains:
| Field | Type | Required | Description |
|---|---|---|---|
name |
str |
Yes | Human-readable preset name |
description |
str |
Yes | Brief description of the preset's purpose |
models |
list[str] |
Yes | List of model filenames (minimum 2) |
algorithm |
str |
Yes | Ensemble algorithm to use |
weights |
list[float] or None |
No | Optional per-model weights |
contributor |
str |
No | Preset author attribution |
Explicit user arguments always take priority over preset defaults. If you specify --ensemble_algorithm alongside --ensemble_preset, your explicit algorithm is used.
The AudioSeparatorAPIClient class provides a Python client for interacting with a remotely deployed vsep API server. This enables offloading separation work to a remote machine with GPU resources.
Import:
from remote.api_client import AudioSeparatorAPIClientAudioSeparatorAPIClient(api_url: str, logger: logging.Logger)| Parameter | Type | Description |
|---|---|---|
api_url |
str |
Base URL of the remote vsep API server (e.g., "https://api.example.com") |
logger |
logging.Logger |
Logger instance for client-side logging |
The client maintains a persistent requests.Session for connection pooling.
Submit an audio separation job and wait for completion. This is the primary convenience method for most use cases, handling the full workflow: upload, poll, and download.
def separate_audio_and_wait(
self,
file_path: str,
model: Optional[str] = None,
models: Optional[List[str]] = None,
preset: Optional[str] = None,
timeout: int = 600,
poll_interval: int = 10,
download: bool = True,
output_dir: Optional[str] = None,
output_format: str = "flac",
output_bitrate: Optional[str] = None,
normalization_threshold: float = 0.9,
amplification_threshold: float = 0.0,
output_single_stem: Optional[str] = None,
invert_using_spec: bool = False,
sample_rate: int = 44100,
use_soundfile: bool = False,
use_autocast: bool = False,
custom_output_names: Optional[Dict[str, str]] = None,
mdx_segment_size: int = 256,
mdx_overlap: float = 0.25,
mdx_batch_size: int = 1,
mdx_hop_length: int = 1024,
mdx_enable_denoise: bool = False,
vr_batch_size: int = 1,
vr_window_size: int = 512,
vr_aggression: int = 5,
vr_enable_tta: bool = False,
vr_high_end_process: bool = False,
vr_enable_post_process: bool = False,
vr_post_process_threshold: float = 0.2,
demucs_segment_size: str = "Default",
demucs_shifts: int = 2,
demucs_overlap: float = 0.25,
demucs_segments_enabled: bool = True,
mdxc_segment_size: int = 256,
mdxc_override_model_segment_size: bool = False,
mdxc_overlap: int = 8,
mdxc_batch_size: int = 1,
mdxc_pitch_shift: int = 0,
) -> dict| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
required | Path to the local audio file to upload |
model |
str or None |
None |
Single model filename |
models |
list[str] or None |
None |
List of model filenames for ensembling |
preset |
str or None |
None |
Ensemble preset name |
timeout |
int |
600 |
Maximum wait time for job completion in seconds |
poll_interval |
int |
10 |
Seconds between status polls |
download |
bool |
True |
Whether to automatically download result files |
output_dir |
str or None |
None |
Directory to save downloaded files |
Plus all architecture-specific parameters (same as the Separator class).
Returns: dict with keys:
| Key | Type | Description |
|---|---|---|
task_id |
str |
The job task ID |
status |
str |
"completed", "error", or "timeout" |
files |
list or dict |
Output filenames (list) or hash-to-filename mapping (dict) |
downloaded_files |
list[str] |
Local file paths of downloaded files (if download=True) |
error |
str |
Error message (if status is "error" or "timeout") |
import logging
from remote.api_client import AudioSeparatorAPIClient
logger = logging.getLogger(__name__)
client = AudioSeparatorAPIClient("https://my-vsep-api.example.com", logger)
result = client.separate_audio_and_wait(
file_path="song.mp3",
model="model_bs_roformer_ep_317_sdr_12.9755.ckpt",
output_format="flac",
output_dir="./output",
timeout=300,
)
print(f"Task {result['task_id']}: {result['status']}")
print(f"Files: {result.get('downloaded_files', [])}")Submit an audio separation job without waiting for completion (asynchronous).
def separate_audio(self, file_path: str, ...) -> dictParameters are the same as separate_audio_and_wait() minus timeout, poll_interval, download, and output_dir.
Returns: dict with key task_id -- use this to poll status with get_job_status().
result = client.separate_audio("song.mp3", model="model.ckpt")
task_id = result["task_id"]
print(f"Job submitted: {task_id}")
# Later, check status:
status = client.get_job_status(task_id)Poll the status of a submitted job.
def get_job_status(self, task_id: str) -> dict| Parameter | Type | Description |
|---|---|---|
task_id |
str |
The task ID returned by separate_audio() |
Returns: dict with keys:
| Key | Type | Description |
|---|---|---|
status |
str |
"completed", "processing", "error", etc. |
progress |
int |
Progress percentage (0-100) |
current_model_index |
int |
Current model being processed (for ensembles) |
total_models |
int |
Total models to process (for ensembles) |
files |
list or dict |
Output files (when status is "completed") |
error |
str |
Error message (when status is "error") |
List available models on the remote server.
def list_models(self, format_type: str = "pretty", filter_by: Optional[str] = None) -> dict| Parameter | Type | Default | Description |
|---|---|---|---|
format_type |
str |
"pretty" |
"pretty" for formatted text, "json" for raw JSON |
filter_by |
str or None |
None |
Filter/sort criteria (same as CLI --list_filter) |
Returns: dict -- If format_type="json", returns the parsed model list. If "pretty", returns {"text": "..."} with the formatted output.
Download a result file from a completed job using its hash identifier.
def download_file_by_hash(self, task_id: str, file_hash: str, filename: str, output_path: Optional[str] = None) -> str| Parameter | Type | Default | Description |
|---|---|---|---|
task_id |
str |
required | The job task ID |
file_hash |
str |
required | Hash identifier of the file (from get_job_status response) |
filename |
str |
required | Original filename (used for local save path) |
output_path |
str or None |
None |
Local save path. If None, uses filename |
Returns: str -- The local file path of the downloaded file.
Legacy method to download a file from a completed job by filename (backward compatibility).
def download_file(self, task_id: str, filename: str, output_path: Optional[str] = None) -> strThis method URL-encodes the filename and uses it directly in the download URL. Prefer download_file_by_hash() for newer servers.
Get the version string of the remote vsep API server.
def get_server_version(self) -> strReturns: str -- The server version (e.g., "0.25.0") or "unknown" if unavailable.
version = client.get_server_version()
print(f"Remote server version: {version}")The following table provides a detailed comparison of all four supported architectures.
| Feature | MDX-Net | VR Band Split | Demucs v4 | MDXC / Roformer |
|---|---|---|---|---|
| Full Name | Multi-Decoder X-Net | Vision-Roadmap Band Split RNN | Hybrid Transformer Demucs v4 | MDX23C / Roformer |
| Backend | ONNX Runtime | ONNX Runtime | PyTorch | PyTorch |
| Model Format | .onnx |
.onnx |
.th + .yaml |
.ckpt + .yaml |
| Parameter Lookup | MD5 hash from UVR data | MD5 hash from UVR data | YAML config file | YAML config file |
| Typical Output Stems | 2 (vocals + instrumental) | 2 (vocals + instrumental) | 4 (vocals, drums, bass, other) | 2 (vocals + instrumental) or more |
| Min Python Version | 3.8+ | 3.8+ | 3.10+ | 3.8+ |
| GPU Acceleration | CUDA, CoreML, DirectML | CUDA, CoreML, DirectML | CUDA, MPS, DirectML | CUDA, MPS, DirectML |
| Segment Size Param | mdx_segment_size (int) |
vr_window_size (int) |
demucs_segment_size (str) |
mdxc_segment_size (int) |
| Default Segment Size | 256 | 512 | "Default" | 256 |
| Overlap Param | mdx_overlap (float) |
N/A | demucs_overlap (float) |
mdxc_overlap (int) |
| Default Overlap | 0.25 | N/A | 0.25 | 8 |
| Batch Size | mdx_batch_size |
vr_batch_size |
N/A | mdxc_batch_size |
| Default Batch Size | 1 | 1 | N/A | 1 |
| Denoise Support | Yes (--mdx_enable_denoise) |
No | No | No |
| Pitch Shift | No | No | No | Yes (--mdxc_pitch_shift) |
| TTA Support | No | Yes (--vr_enable_tta) |
Yes (--demucs_shifts) |
No |
| Post-Processing | No | Yes (--vr_enable_post_process) |
No | No |
| Autocast Support | N/A (ONNX) | N/A (ONNX) | Yes | Yes |
| High End Processing | No | Yes (--vr_high_end_process) |
No | No |
| Aggression Control | No | Yes (--vr_aggression, -100 to 100) |
No | No |
| Model Override Segment Size | N/A | N/A | N/A | Yes (--mdxc_override_model_segment_size) |
| Ensemble Support | Yes | Yes | Yes | Yes |
| Chunk Duration Support | Yes | Yes | Yes | Yes |
- Best vocal quality: Roformer models (MDXC architecture) typically achieve the highest SDR scores for vocal separation
- Best multi-stem separation: Demucs v4 is the only architecture that natively produces 4+ stems (vocals, drums, bass, other)
- Fastest inference on CPU: MDX-Net and VR models using ONNX Runtime are typically faster on CPU
- Best on GPU: Demucs v4 and MDXC/Roformer models benefit most from GPU acceleration via PyTorch
- Fine-tuned control: VR architecture offers the most tunable parameters (aggression, TTA, post-processing, high end processing)
- Best overall balance: MDXC with Roformer models (e.g.,
model_bs_roformer_ep_317_sdr_12.9755.ckpt) provides the best quality-to-speed ratio for 2-stem separation
| Pattern | Architecture | Example |
|---|---|---|
*.onnx |
MDX-Net or VR | MDX-Net_Model.onnx, 5_HP-Karaoke-UVR.pth.onnx |
htdemucs*.yaml |
Demucs v4 | htdemucs_ft.yaml, htdemucs.yaml |
MDX23C-*.ckpt |
MDXC | MDX23C-8KFFT-InstVoc_HQ.ckpt |
model_*_roformer_*.ckpt |
MDXC (Roformer) | model_bs_roformer_ep_317_sdr_12.9755.ckpt |
Mel-Band-Roformer*.ckpt |
MDXC (Mel-Band Roformer) | Mel-Band-Roformer-Karaoke-Run1.ckpt |