Segment Anything Model for Audio
SAM-Audio is a foundation model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.
Requirements:
- Python >= 3.10
- CUDA-compatible GPU (recommended)
Install dependencies:
pip install .hf auth login after generating an access token.)
from sam_audio import SAMAudio, SAMAudioProcessor
import torchaudio
model = SAMAudio.from_pretrained("facebook/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
model = model.eval().cuda()
file = "<audio file>" # audio file path or torch tensor
description = "<description>"
batch = processor(
audios=[file],
descriptions=[description],
).to("cuda")
result = model.separate(batch)
# Save separated audio
sample_rate = processor.audio_sampling_rate
torchaudio.save("target.wav", result.target.cpu(), sample_rate) # The isolated sound
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate) # Everything elseSAM-Audio supports three types of prompts:
-
Text Prompting: Describe the sound you want to isolate using natural language
processor(audios=[audio], descriptions=["A man speaking"])
-
Visual Prompting: Use video frames and masks to isolate sounds associated with visual objects
processor(audios=[video], descriptions=[""], masked_videos=processor.mask_videos([frames], [mask]))
-
Span Prompting: Specify time ranges where the target sound occurs
processor(audios=[audio], descriptions=["A horn honking"], anchors=[[["+", 6.3, 7.0]]])
See the examples directory for more detailed examples
Below is a table of each of the models we released along with their overall subjective evaluation scores
| Model | General SFX | Speech | Speaker | Music | Instr(wild) | Instr(pro) |
|---|---|---|---|---|---|---|
sam-audio-small |
3.62 | 3.99 | 3.12 | 4.11 | 3.56 | 4.24 |
sam-audio-base |
3.28 | 4.25 | 3.57 | 3.87 | 3.66 | 4.27 |
sam-audio-large |
3.50 | 4.03 | 3.60 | 4.22 | 3.66 | 4.49 |
We additional release another variant (in each size) that performs better specifically on visual prompting:
See the eval directory for instructions and scripts to reproduce results from the paper
See contributing and code of conduct for more information.
This project is licensed under the SAM License - see the LICENSE file for details.
