AI-powered visual web research agent — speak a task, watch it navigate live sites with Gemini vision, get a spoken briefing and a comparison report.
Track: UI Navigator · Visual UI understanding & interaction
|
See Voyance research, verify, and narrate in a real end-to-end flow. |
Read the architecture and implementation decisions behind Voyance. |
- Voyance
Voyance turns natural language into competitive intelligence in minutes:
| Step | Description |
|---|---|
| 1. You say | What you need — e.g. "Compare pricing for the top 5 CRM tools" |
| 2. The agent | Plans, visits 3–5 live websites, and “reads” pages with Gemini multimodal vision (screenshots only — no DOM scraping) |
| 3. You get | A sortable comparison table, CSV/HTML export, and Vera (ElevenLabs) reading the briefing aloud |
No DOM hacks, no site-specific APIs. Works across site redesigns. Backend is configured for deployment on Google Cloud Run.
- Natural language input — Describe your research task in plain English (e.g. compare pricing, features, or reviews).
- Multi-site research — Agent visits 3–5 live websites per task with no DOM scraping or site-specific APIs.
- Gemini vision — Screenshot-based page understanding; works across redesigns and any site.
- Comparison table — Sortable results with company, segment, pricing, and key details.
- Export — Download results as CSV or HTML.
- Spoken briefing (Vera) — ElevenLabs TTS reads the summary aloud.
- Interrupt + replan — During a run, you can submit a redirect instruction (text/voice); the agent queues it and replans on the next loop iteration.
- Fact verification — Perplexity-backed claim checks where relevant.
Hero — Enter your research query and start the agent.
Output — Comparison table, CSV/HTML export, and Listen to Vera.
| Requirement | Voyance |
|---|---|
| Gemini model | Gemini 2.0 Flash (planning, screenshot analysis, synthesis) |
| Google GenAI SDK / ADK | Google GenAI SDK (google-generativeai): Gemini for planning, vision, synthesis. Custom agent loop (plan → navigate → extract → verify), not the ADK library. |
| Google Cloud service | Backend deployment target is Google Cloud Run (infra/cloudbuild.yaml, infra/main.tf) |
| UI Navigator | Screenshots analyzed by Gemini vision; agent outputs navigation and extraction actions |
Third-party: ElevenLabs (Vera TTS), Firecrawl (extraction), Perplexity (fact verification).
- Live backend URL: voyance-backend-712979751443.us-central1.run.app
- Judge artifact:
Google-Cloud-Logs-Voyance.png(Cloud Run logs screenshot)
- Node.js 18+
- Python 3.10+
- API keys: Google AI Studio (Gemini), ElevenLabs, Firecrawl, Perplexity — see
backend/.env.example
git clone https://github.com/ibtisamafzal/voyance.git
cd voyance
npm installcd backend
pip install -r requirements.txt
playwright install chromium
cp .env.example .env
# Edit .env with your API keys
uvicorn main:app --host 0.0.0.0 --port 8000 --reload| Service | URL |
|---|---|
| Backend | http://localhost:8000 |
| API docs | http://localhost:8000/api/docs |
From the repo root (new terminal):
npm run devFrontend: http://localhost:5173
- Enter a query in the hero (e.g. "Compare pricing for top 5 CRM tools").
- Click Research — the agent plans, navigates, extracts, and verifies.
- In the Output section: sort the table, export CSV or HTML, and click Listen to Vera for the spoken briefing.
| Layer | Technology |
|---|---|
| AI & vision | Gemini 2.0 Flash |
| Browser | Playwright (headless Chromium), screenshot-based only |
| Extraction | Firecrawl API → Gemini vision fallback |
| Verification | Perplexity API |
| Voice | ElevenLabs TTS (Vera) |
| Backend | FastAPI, WebSockets; Google Cloud Run deployment target |
| Frontend | React, Vite, Tailwind |
| Infra | Docker, Cloud Build, Terraform (infra/) |
User and frontend → backend (Cloud Run target) → Gemini, Playwright, Firecrawl, Perplexity, ElevenLabs.
From idea to implementation at a glance.
This mind map captures the core of Voyance for the Gemini Live Agent Challenge — from the problem and solution, through key features and technical stack, to user personas and submission requirements.
Copy backend/.env.example to backend/.env and set:
| Variable | Purpose |
|---|---|
GEMINI_API_KEY |
Google AI Studio |
ELEVENLABS_API_KEY |
Vera TTS |
FIRECRAWL_API_KEY |
Fast extraction |
PERPLEXITY_API_KEY |
Fact verification |
GOOGLE_CLOUD_PROJECT |
Optional (Firestore); in-memory fallback if unset |
CONTACT_EMAIL |
Contact form recipient email (set in server env) |
CONTACT_EMAIL_APP_PASSWORD |
Gmail App Password for SMTP contact form sending |
-
Backend: Google Cloud Run. Deploy with
infra/cloudbuild.yamlfrom repo root:gcloud builds submit --config=infra/cloudbuild.yaml .Default: 1 GiB memory, 1 CPU (increase to 2 GiB if needed for Playwright).
-
Frontend: Host on Vercel or any static host; set
VITE_API_URLto your Cloud Run URL (no trailing slash).
Troubleshooting: Stuck on "Connecting…" → set VITE_API_URL on your host. WebSocket 403 → ensure no trailing slash in VITE_API_URL. OOM → increase memory in cloudbuild.yaml.
├── src/app/ # React frontend
│ ├── components/ # HeroSection, ResearchOutputSection, Navbar, etc.
│ └── context/ # ResearchContext (shared state)
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── agent.py # Research loop (plan → navigate → extract → verify)
│ │ ├── routers/ # Research, voice, health, sessions
│ │ └── services/ # Gemini, Firecrawl, Perplexity, Playwright, ElevenLabs
│ └── main.py
└── infra/ # GCP automation
├── cloudbuild.yaml # Build & deploy to Cloud Run
└── main.tf # Terraform
- Deep-dive blog: How We Built Voyance (DEV.to)
- Reddit build log: How we built Voyance — an AI agent that researches the web by “seeing” it
- Hackathon submission: Gemini Live Agent Challenge — UI Navigator track
- Source code: Voyance on GitHub
- GDG profile: g.dev/IbtisamAfzal
| Contact | Use the in-app contact form (/contact) |
| linkedin.com/in/ibtisamafzal |
Blog: How We Built Voyance (DEV) · Hackathon: Gemini Live Agent Challenge (see Devpost for current schedule)
MIT




