-
Notifications
You must be signed in to change notification settings - Fork 0
Multimodal Module
gitpavleenbali edited this page Feb 17, 2026
·
2 revisions
The Multimodal module enables AI agents to process and generate content across multiple modalities: images, audio, and video.
from pyai.multimodal import Image, Audio, Video, MultimodalContent| Component | Description |
|---|---|
| ImageContent | Image processing and analysis |
| AudioContent | Audio file handling |
| VideoContent | Video processing |
| MultimodalContent | Mixed content container |
from pyai import ask
from pyai.multimodal import Image
# Analyze an image
image = Image.from_file("photo.jpg")
response = ask("What's in this image?", images=[image])
print(response)images = [
Image.from_file("before.jpg"),
Image.from_file("after.jpg")
]
response = ask(
"Compare these two images and describe the differences",
images=images
)image = Image.from_url("https://example.com/image.jpg")
response = ask("Describe this image", images=[image])import base64
with open("image.png", "rb") as f:
data = base64.b64encode(f.read()).decode()
image = Image.from_base64(data, media_type="image/png")Combine multiple types of content:
from pyai.multimodal import MultimodalContent, Image, Audio
content = MultimodalContent()
content.add_text("Please analyze this meeting recording and slides:")
content.add_image(Image.from_file("slides.png"))
content.add_audio(Audio.from_file("meeting.mp3"))
response = agent.run(content)from pyai import Agent
from pyai.multimodal import Image
agent = Agent(
name="ImageAnalyzer",
instructions="You are an expert at analyzing images.",
model="gpt-4o" # Vision-capable model
)
image = Image.from_file("diagram.png")
result = agent.run("Explain this diagram", images=[image])- PNG, JPEG, GIF, WebP
- Max size varies by model (typically 20MB)
- Auto-resizing available
- MP3, WAV, M4A, FLAC, OGG
- Transcription integration
- MP4, MOV, WebM
- Frame extraction for analysis
from pyai.multimodal import Image
image = Image.from_file("large_photo.jpg")
# Resize for API limits
image = image.resize(max_width=1024, max_height=1024)
# Convert format
image = image.convert(format="jpeg", quality=85)
# Get dimensions
print(f"Size: {image.width}x{image.height}")| Provider | Images | Audio | Video |
|---|---|---|---|
| OpenAI GPT-4o | β | β | β |
| Anthropic Claude 3 | β | β | β |
| Google Gemini | β | β | β |
- ImageContent - Image handling
- AudioContent - Audio handling
- VideoContent - Video handling
Intelligence, Embedded.