Multimodal Module

The Multimodal module enables AI agents to process and generate content across multiple modalities: images, audio, and video.

Overview

from pyai.multimodal import Image, Audio, Video, MultimodalContent

Key Components

Component	Description
ImageContent	Image processing and analysis
AudioContent	Audio file handling
VideoContent	Video processing
MultimodalContent	Mixed content container

Quick Start

Image Analysis

from pyai import ask
from pyai.multimodal import Image

# Analyze an image
image = Image.from_file("photo.jpg")
response = ask("What's in this image?", images=[image])
print(response)

Multiple Images

images = [
    Image.from_file("before.jpg"),
    Image.from_file("after.jpg")
]

response = ask(
    "Compare these two images and describe the differences",
    images=images
)

From URL

image = Image.from_url("https://example.com/image.jpg")
response = ask("Describe this image", images=[image])

Base64 Encoded

import base64

with open("image.png", "rb") as f:
    data = base64.b64encode(f.read()).decode()

image = Image.from_base64(data, media_type="image/png")

MultimodalContent

Combine multiple types of content:

from pyai.multimodal import MultimodalContent, Image, Audio

content = MultimodalContent()
content.add_text("Please analyze this meeting recording and slides:")
content.add_image(Image.from_file("slides.png"))
content.add_audio(Audio.from_file("meeting.mp3"))

response = agent.run(content)

With Agents

from pyai import Agent
from pyai.multimodal import Image

agent = Agent(
    name="ImageAnalyzer",
    instructions="You are an expert at analyzing images.",
    model="gpt-4o"  # Vision-capable model
)

image = Image.from_file("diagram.png")
result = agent.run("Explain this diagram", images=[image])

Supported Formats

Images

PNG, JPEG, GIF, WebP
Max size varies by model (typically 20MB)
Auto-resizing available

Audio

MP3, WAV, M4A, FLAC, OGG
Transcription integration

Video

MP4, MOV, WebM
Frame extraction for analysis

Image Processing

from pyai.multimodal import Image

image = Image.from_file("large_photo.jpg")

# Resize for API limits
image = image.resize(max_width=1024, max_height=1024)

# Convert format
image = image.convert(format="jpeg", quality=85)

# Get dimensions
print(f"Size: {image.width}x{image.height}")

Provider Support

Provider	Images	Audio	Video
OpenAI GPT-4o	✅	✅	✅
Anthropic Claude 3	✅	❌	❌
Google Gemini	✅	✅	✅

🧠 PYAI Wiki

Home

🚀 Getting Started

💡 Core Concepts

🎯 One-Liner APIs

🤖 Agent Framework

🔗 Multi-Agent

🛠️ Tools & Skills

🔒 Security

📚 Reference

_{Intelligence, Embedded.}

Multimodal Module

Multimodal Module

Overview

Key Components

Quick Start

Image Analysis

Multiple Images

From URL

Base64 Encoded

MultimodalContent

With Agents

Supported Formats

Images

Audio

Video

Image Processing

Provider Support

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!