Multimodal support for image understanding

## Goal
Allow the agent to consume and reason over images as first-class inputs.

## Proposed Scope
- Add image ingestion, vision-capable model routing, and multimodal prompt/tool plumbing.

## Acceptance Criteria
- Users can provide images and receive grounded responses/actions using visual context.

## Target Date
- 22 Aug 2026 (IST)