Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.
This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.
As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.
While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.
The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:
| Task | Description | Links |
|---|---|---|
| audio-text-to-text | Models that accept audio + text input and produce text output | |
| any-to-any | Omni-modal models handling any input/output pairing (subsumes audio) |
| Resource | Link |
|---|---|
| Task Overview | audio-text-to-text |
| Models (Trending) | Browse models |
| Datasets | Browse datasets |
| Resource | Link |
|---|---|
| Task Overview | any-to-any |
| Models (Trending) | Browse models |
| Datasets | Browse datasets |
| Document | Description |
|---|---|
| models/index.md | Complete index of all audio multimodal models |
| models.md | Featured open-source audio multimodal models with detailed profiles |
| companies.md | Companies developing audio multimodal models (open source focus) |
| providers.md | Organizations developing audio multimodal (open & closed source) |
| benchmarks.md | Evaluation frameworks and leaderboards |
| scope.md | Definition of what "audio multimodal" means in this context |
| Location | Description |
|---|---|
| notes/ | Personal notes on nomenclature, parameters, and reference links |
| notes/nomenclature.md | Terminology and naming conventions |
| notes/parameters.md | Model parameter sizes for deployment planning |
| notes/ref.md | Quick reference links (HuggingFace task pages) |
The ask-ai/ directory contains AI-assisted research outputs:
| Document | Description |
|---|---|
| ask-ai/prompt.md | The prompt used to generate the analysis |
| ask-ai/outputs/models.md | Comprehensive model list beyond featured models |
| ask-ai/outputs/nomenclature.md | Terminology analysis across vendors and research |
| ask-ai/outputs/benchmarks.md | Extended benchmark coverage by workflow type |
| ask-ai/outputs/pros-cons.md | Comparison of STT vs pipeline vs multimodal approaches |
| ask-ai/outputs/redundancy-analysis.md | Will multimodal ASR make traditional STT redundant? |
| ask-ai/outputs/ecosystem.md | Ecosystem overview and emerging trends |
| Location | Description |
|---|---|
| data/ | Raw exports from Hugging Face API (CSV/JSON) |
| Document | Description |
|---|---|
| resource-lists.md | Curated awesome-lists for multimodal AI |
| models-hf.md | GitHub repositories for audio multimodal models |
| papers.md | Research papers and academic resources |
| tooling.md | Data pipeline and processing tools |
| eval-tools.md | Evaluation frameworks and test prompts |
| inference-tools.md | Tools for running inference at scale |
| demos-and-starters.md | Example implementations and starter projects |
| github-tags.md | GitHub topic pages for discovery |
A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.
| Location | Description |
|---|---|
| evaluations/README.md | Evaluation framework overview and methodology |
| evaluations/test-prompts/ | Complete test prompt library |
Human-Authored Prompts (by-daniel/):
| Prompt | Tests |
|---|---|
| accent-identification.md | Regional accent detection with grounded examples |
| guess-my-mood.md | Emotional analysis, fatigue detection, word-tone dissonance |
| non-verbal-context.md | Multi-speaker interpersonal dynamics, pauses as communication |
| parameters.md | Vocal frequency analysis for audio engineering (EQ recommendations) |
| who-is-this.md | Speaker identification/recognition |
AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.
The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.
Audio multimodal models are fundamentally different:
- Native audio understanding: Process audio tokens directly alongside text prompts
- Unified inference: Single API call handles transcription, formatting, and summarization
- Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output
Instead of chaining: Whisper → GPT-4 → Formatting
Audio multimodal enables: Single API call with system prompt → Formatted output
Use cases:
- Voice journals with structured formatting
- Conference call summarization
- Accent/voice analysis
- Long-form audio processing (tested with 1-hour recordings)
See models/ for detailed profiles:
| Model | Developer | Parameters | License |
|---|---|---|---|
| Qwen Omni | Alibaba | 7B-35B | Apache 2.0 |
| Gemma 3n | 2B-4B effective | Gemma | |
| Macaw-LLM | Chenyang Lyu et al. | 7B-13B | Apache 2.0 |
| Model | Developer | Parameters | License |
|---|---|---|---|
| Audio Flamingo 3 | NVIDIA | 8B | Non-commercial |
| BuboGPT | ByteDance | 7B-13B | BSD 3-Clause |
| Kimi-Audio | Moonshot AI | 10B | MIT/Apache 2.0 |
| OmniAudio | NexaAI | 2.6B | Apache 2.0 |
| Phi-4-Multimodal | Microsoft | 5.6B | MIT |
| Qwen2-Audio | Alibaba | 8B | Apache 2.0 |
| SALMONN | ByteDance/Tsinghua | 7B-13B | Apache 2.0 |
| Soundwave | FreedomIntelligence | 9B | Apache 2.0 |
| Step-Audio-Chat | StepFun | 130B | Apache 2.0 |
| Step-Audio-R1 | StepFun | 33B | Apache 2.0 |
| Ultravox | Fixie.ai | 8B-70B | MIT |
| Voxtral | Mistral AI | 5B-24B | Apache 2.0 |
See providers.md for the full list, or companies.md for a company-to-models mapping:
- Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
- Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI
See benchmarks.md for full coverage of evaluation frameworks and leaderboards.
| Benchmark | Developer | Focus | Links |
|---|---|---|---|
| MSEB | Google Research | Sound embedding evaluation | GitHub · Blog |
| UltraEval-Audio | OpenBMB | Speech understanding & generation | GitHub |
| lmms-eval | EvolvingLMMs Lab | 100+ multimodal tasks | GitHub |
| VERSA | WavLab Speech | 90+ speech/audio metrics | GitHub |
| AudioBench | AudioLLMs | Comprehensive audio LLM | GitHub · Leaderboard |
Leaderboards: AudioBench · Open ASR
- Awesome-Audio-LLM - Curated list of audio LLM research
- Hugging Face audio-text-to-text models - Browse latest models
- Hugging Face ASR models - Traditional ASR for comparison
Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.
This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.
Created: December 7, 2025 | Updated: December 8, 2025