Skip to content

danielrosehill/Audio-Multimodal-AI-Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal AI - Audio-Text-To-Text Modality (Resources, Notes)

Collection of open-source multimodal models with audio support, focusing on models that can process audio tokens and understand them in conjunction with text prompts.

Overview

This repository catalogs and analyzes a relatively small but significant subclassification of multimodal models: those with native audio support. Audio in addition to audio and text to text. another category of models which are potentially in scope include any-to-any: these are models which, as the name suggests, are built to handle any input and output pairing.

As of December 7, 2025, these models are classified on Hugging Face under the "multimodal" category rather than the "audio" category—an interesting distinction that reflects their fundamentally different architecture from traditional ASR models.

While the primary focus is open-source models, closed-source providers are included for completeness given the relatively small size of this emerging field.

Hugging Face Task Classification Mapping

The focus of this resource list maps to these two tasks in Hugging Face's (current) classification system for AI tasks:

Task Description Links
audio-text-to-text Models that accept audio + text input and produce text output Task Models
any-to-any Omni-modal models handling any input/output pairing (subsumes audio) Task Models

Hugging Face Resources

Audio Text To Text

Resource Link
Task Overview audio-text-to-text
Models (Trending) Browse models
Datasets Browse datasets

Omni / All-Modality Multimodal

Resource Link
Task Overview any-to-any
Models (Trending) Browse models
Datasets Browse datasets

Repository Index

Core Documentation

Document Description
models/index.md Complete index of all audio multimodal models
models.md Featured open-source audio multimodal models with detailed profiles
companies.md Companies developing audio multimodal models (open source focus)
providers.md Organizations developing audio multimodal (open & closed source)
benchmarks.md Evaluation frameworks and leaderboards
scope.md Definition of what "audio multimodal" means in this context

Notes & Research

Location Description
notes/ Personal notes on nomenclature, parameters, and reference links
notes/nomenclature.md Terminology and naming conventions
notes/parameters.md Model parameter sizes for deployment planning
notes/ref.md Quick reference links (HuggingFace task pages)

AI-Generated Analysis

The ask-ai/ directory contains AI-assisted research outputs:

Document Description
ask-ai/prompt.md The prompt used to generate the analysis
ask-ai/outputs/models.md Comprehensive model list beyond featured models
ask-ai/outputs/nomenclature.md Terminology analysis across vendors and research
ask-ai/outputs/benchmarks.md Extended benchmark coverage by workflow type
ask-ai/outputs/pros-cons.md Comparison of STT vs pipeline vs multimodal approaches
ask-ai/outputs/redundancy-analysis.md Will multimodal ASR make traditional STT redundant?
ask-ai/outputs/ecosystem.md Ecosystem overview and emerging trends

Data

Location Description
data/ Raw exports from Hugging Face API (CSV/JSON)

Resources & Links

Document Description
resource-lists.md Curated awesome-lists for multimodal AI
models-hf.md GitHub repositories for audio multimodal models
papers.md Research papers and academic resources
tooling.md Data pipeline and processing tools
eval-tools.md Evaluation frameworks and test prompts
inference-tools.md Tools for running inference at scale
demos-and-starters.md Example implementations and starter projects
github-tags.md GitHub topic pages for discovery

Evaluations & Benchmarking

A custom evaluation framework for testing true audio understanding capabilities—what separates audio multimodal models from traditional STT.

Location Description
evaluations/README.md Evaluation framework overview and methodology
evaluations/test-prompts/ Complete test prompt library

Test Prompt Categories

Human-Authored Prompts (by-daniel/):

Prompt Tests
accent-identification.md Regional accent detection with grounded examples
guess-my-mood.md Emotional analysis, fatigue detection, word-tone dissonance
non-verbal-context.md Multi-speaker interpersonal dynamics, pauses as communication
parameters.md Vocal frequency analysis for audio engineering (EQ recommendations)
who-is-this.md Speaker identification/recognition

AI-Generated Prompts (ai-generated/): Extended benchmark covering additional audio understanding dimensions.


Why Audio Multimodal Matters

Classic STT vs. Audio Multimodal

The audio category on Hugging Face includes ASR (Automatic Speech Recognition) models like Whisper, Parakeet, and Wav2Vec, along with supporting components (diarization, VAD, punctuation restoration). These are powerful but follow a traditional pipeline approach.

Audio multimodal models are fundamentally different:

  • Native audio understanding: Process audio tokens directly alongside text prompts
  • Unified inference: Single API call handles transcription, formatting, and summarization
  • Prompt-guided processing: Can be instructed to analyze accents, describe voices, or format output

Practical Advantages

Instead of chaining: Whisper → GPT-4 → Formatting

Audio multimodal enables: Single API call with system prompt → Formatted output

Use cases:

  • Voice journals with structured formatting
  • Conference call summarization
  • Accent/voice analysis
  • Long-form audio processing (tested with 1-hour recordings)

Featured Models

See models/ for detailed profiles:

Any-to-Any (Omni-Modal)

Model Developer Parameters License
Qwen Omni Alibaba 7B-35B Apache 2.0
Gemma 3n Google 2B-4B effective Gemma
Macaw-LLM Chenyang Lyu et al. 7B-13B Apache 2.0

Audio-Text-to-Text

Model Developer Parameters License
Audio Flamingo 3 NVIDIA 8B Non-commercial
BuboGPT ByteDance 7B-13B BSD 3-Clause
Kimi-Audio Moonshot AI 10B MIT/Apache 2.0
OmniAudio NexaAI 2.6B Apache 2.0
Phi-4-Multimodal Microsoft 5.6B MIT
Qwen2-Audio Alibaba 8B Apache 2.0
SALMONN ByteDance/Tsinghua 7B-13B Apache 2.0
Soundwave FreedomIntelligence 9B Apache 2.0
Step-Audio-Chat StepFun 130B Apache 2.0
Step-Audio-R1 StepFun 33B Apache 2.0
Ultravox Fixie.ai 8B-70B MIT
Voxtral Mistral AI 5B-24B Apache 2.0

Providers

See providers.md for the full list, or companies.md for a company-to-models mapping:

  • Open Source: Alibaba, ByteDance, Fixie.ai, FreedomIntelligence, Google DeepMind, Microsoft, Mistral AI, Moonshot AI, NexaAI, NVIDIA, StepFun
  • Closed Source: Google (Gemini), OpenAI (GPT-4o), Anthropic (Claude), Reka AI

Benchmarks

See benchmarks.md for full coverage of evaluation frameworks and leaderboards.

Benchmark Developer Focus Links
MSEB Google Research Sound embedding evaluation GitHub · Blog
UltraEval-Audio OpenBMB Speech understanding & generation GitHub
lmms-eval EvolvingLMMs Lab 100+ multimodal tasks GitHub
VERSA WavLab Speech 90+ speech/audio metrics GitHub
AudioBench AudioLLMs Comprehensive audio LLM GitHub · Leaderboard

Leaderboards: AudioBench · Open ASR


External Resources


Future of Voice AI

Audio multimodal represents what may be the successor to first-wave STT models. The ability to handle transcription, cleanup, and formatting in a single unified inference process—without the complexity of VAD, punctuation restoration, and post-processing chains—makes this an elegant and powerful approach to voice AI.

Updates

This repository will be periodically updated as the field evolves. Given the rapid pace of AI development, timestamps are included throughout.


Created: December 7, 2025 | Updated: December 8, 2025

About

A compilation of resources (model profiles, benchmarks, docs) for multimodal AI models with audio understanding (esp. focused on ASR and transcription use-cases)

Topics

Resources

Stars

Watchers

Forks

Contributors