I've searched existing issues and I confirm this is not a duplicate.
Description
Support multimodal input by forwarding Discord message attachments (images, documents, voice messages, etc.) to downstream ACP agents.
Currently, discord.rs only extracts msg.content (text) and ignores msg.attachments entirely. The ACP prompt in connection.rs is hardcoded to "type": "text" only:
"prompt": [{"type": "text", "text": prompt}],
This means any image, file, or voice attachment sent by users in Discord is silently dropped.
Proposed Changes
- Discord handler (
src/discord.rs): Parse msg.attachments, download or extract URLs for images/files/voice.
- ACP prompt (
src/acp/connection.rs): Extend the prompt array to include additional content types (e.g. "type": "image", "type": "file") alongside the existing text.
- Validation: Check whether the downstream ACP agent supports multimodal content types and gracefully degrade to text-only if not.
Use Case
Users on Discord frequently share screenshots, error logs, documents, and voice messages when asking for help. Without multimodal support, the agent cannot see or process these attachments, limiting its usefulness to text-only interactions.
I've searched existing issues and I confirm this is not a duplicate.
Description
Support multimodal input by forwarding Discord message attachments (images, documents, voice messages, etc.) to downstream ACP agents.
Currently,
discord.rsonly extractsmsg.content(text) and ignoresmsg.attachmentsentirely. The ACP prompt inconnection.rsis hardcoded to"type": "text"only:This means any image, file, or voice attachment sent by users in Discord is silently dropped.
Proposed Changes
src/discord.rs): Parsemsg.attachments, download or extract URLs for images/files/voice.src/acp/connection.rs): Extend thepromptarray to include additional content types (e.g."type": "image","type": "file") alongside the existing text.Use Case
Users on Discord frequently share screenshots, error logs, documents, and voice messages when asking for help. Without multimodal support, the agent cannot see or process these attachments, limiting its usefulness to text-only interactions.