SAM-Audio

Segment Anything Model for Audio

SAM-Audio is a foundation model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.

Setup

Requirements:

Python >= 3.10
CUDA-compatible GPU (recommended)

Install dependencies:

pip install .

Usage

⚠️ Before using SAM Audio, please request access to the checkpoints on the SAM Audio Hugging Face repo. Once accepted, you need to be authenticated to download the checkpoints. You can do this by running the following steps (e.g. hf auth login after generating an access token.)

Basic Text Prompting

from sam_audio import SAMAudio, SAMAudioProcessor
import torchaudio

model = SAMAudio.from_pretrained("facebook/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
model = model.eval().cuda()

file = "<audio file>" # audio file path or torch tensor
description = "<description>"

batch = processor(
    audios=[file],
    descriptions=[description],
).to("cuda")

result = model.separate(batch)

# Save separated audio
sample_rate = processor.audio_sampling_rate
torchaudio.save("target.wav", result.target.cpu(), sample_rate)      # The isolated sound
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate)  # Everything else

Prompting Methods

SAM-Audio supports three types of prompts:

Text Prompting: Describe the sound you want to isolate using natural language
```
processor(audios=[audio], descriptions=["A man speaking"])
```

Visual Prompting: Use video frames and masks to isolate sounds associated with visual objects

processor(audios=[video], descriptions=[""], masked_videos=processor.mask_videos([frames], [mask]))

Span Prompting: Specify time ranges where the target sound occurs

processor(audios=[audio], descriptions=["A horn honking"], anchors=[[["+", 6.3, 7.0]]])

See the examples directory for more detailed examples

Models

Below is a table of each of the models we released along with their overall subjective evaluation scores

Model	General SFX	Speech	Speaker	Music	Instr(wild)	Instr(pro)
`sam-audio-small`	3.62	3.99	3.12	4.11	3.56	4.24
`sam-audio-base`	3.28	4.25	3.57	3.87	3.66	4.27
`sam-audio-large`	3.50	4.03	3.60	4.22	3.66	4.49

We additional release another variant (in each size) that performs better specifically on visual prompting:

Evaluation

See the eval directory for instructions and scripts to reproduce results from the paper

Contributing

See contributing and code of conduct for more information.

License

This project is licensed under the SAM License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
assets		assets
eval		eval
examples		examples
sam_audio		sam_audio
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAM-Audio

Setup

Usage

Basic Text Prompting

Prompting Methods

Models

Evaluation

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

WildKratts/sam-audio

Folders and files

Latest commit

History

Repository files navigation

SAM-Audio

Setup

Usage

Basic Text Prompting

Prompting Methods

Models

Evaluation

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages