Skip to content

Multimodal Module

gitpavleenbali edited this page Feb 17, 2026 · 2 revisions

Multimodal Module

The Multimodal module enables AI agents to process and generate content across multiple modalities: images, audio, and video.

Overview

from pyai.multimodal import Image, Audio, Video, MultimodalContent

Key Components

Component Description
ImageContent Image processing and analysis
AudioContent Audio file handling
VideoContent Video processing
MultimodalContent Mixed content container

Quick Start

Image Analysis

from pyai import ask
from pyai.multimodal import Image

# Analyze an image
image = Image.from_file("photo.jpg")
response = ask("What's in this image?", images=[image])
print(response)

Multiple Images

images = [
    Image.from_file("before.jpg"),
    Image.from_file("after.jpg")
]

response = ask(
    "Compare these two images and describe the differences",
    images=images
)

From URL

image = Image.from_url("https://example.com/image.jpg")
response = ask("Describe this image", images=[image])

Base64 Encoded

import base64

with open("image.png", "rb") as f:
    data = base64.b64encode(f.read()).decode()

image = Image.from_base64(data, media_type="image/png")

MultimodalContent

Combine multiple types of content:

from pyai.multimodal import MultimodalContent, Image, Audio

content = MultimodalContent()
content.add_text("Please analyze this meeting recording and slides:")
content.add_image(Image.from_file("slides.png"))
content.add_audio(Audio.from_file("meeting.mp3"))

response = agent.run(content)

With Agents

from pyai import Agent
from pyai.multimodal import Image

agent = Agent(
    name="ImageAnalyzer",
    instructions="You are an expert at analyzing images.",
    model="gpt-4o"  # Vision-capable model
)

image = Image.from_file("diagram.png")
result = agent.run("Explain this diagram", images=[image])

Supported Formats

Images

  • PNG, JPEG, GIF, WebP
  • Max size varies by model (typically 20MB)
  • Auto-resizing available

Audio

  • MP3, WAV, M4A, FLAC, OGG
  • Transcription integration

Video

  • MP4, MOV, WebM
  • Frame extraction for analysis

Image Processing

from pyai.multimodal import Image

image = Image.from_file("large_photo.jpg")

# Resize for API limits
image = image.resize(max_width=1024, max_height=1024)

# Convert format
image = image.convert(format="jpeg", quality=85)

# Get dimensions
print(f"Size: {image.width}x{image.height}")

Provider Support

Provider Images Audio Video
OpenAI GPT-4o βœ… βœ… βœ…
Anthropic Claude 3 βœ… ❌ ❌
Google Gemini βœ… βœ… βœ…

See Also

🧠 PYAI Wiki

Home


πŸš€ Getting Started


πŸ’‘ Core Concepts


🎯 One-Liner APIs


πŸ€– Agent Framework


πŸ”— Multi-Agent


πŸ› οΈ Tools & Skills


🏒 Enterprise


πŸŽ™οΈ Voice


πŸ–ΌοΈ Multimodal


πŸ“Š Vector DB


🌐 OpenAPI


πŸ”Œ Plugins


🀝 A2A Protocol


πŸ”’ Security


πŸ“š Reference


Intelligence, Embedded.

Clone this wiki locally