The World's First Native Live-Agent Browser
Voice-controlled autonomous web agent powered by Gemini Live API + Google ADK
Gemini Live Agent Challenge • Category: UI Navigator
Shows real-time voice conversation, multi-tab autonomous browsing, and task completion — all hands-free.
Lobster is an AI-native web browser with a built-in live voice agent. You talk to it, it talks back — and it autonomously controls the browser to complete tasks for you.
Unlike browser extensions or copilots, Lobster is built from scratch as an Electron desktop app where the AI agent is a first-class citizen — it has its own tabs, its own vision, and works independently of what you're looking at.
Core innovation: Two-Brain Architecture combining Gemini Live API (real-time bidirectional voice) with Google ADK (autonomous browser executor) — the Conductor hears and speaks, the Executor sees and acts.
- Always-on voice conversation — speak naturally, no push-to-talk, barge-in support
- Autonomous browser control — agent navigates, clicks, types, scrolls in its own tabs
- Vision-based understanding — screenshots + numbered DOM element map (100% accurate clicking)
- Multi-tab parallel execution — "Compare prices on Amazon, eBay, and Walmart" → 3 tabs simultaneously
- Background operation — agent works in background tabs while you browse freely
- Scheduled tasks (Cron) — "Check Reddit every 5 minutes for new posts"
- ReAct reasoning — see the agent's thought process in real-time
┌─────────────────────────────────────────────────────────────┐
│ ELECTRON DESKTOP APP │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ User Tab │ │ Agent Tab│ │ Agent Tab│ ... │
│ │ (active) │ │ (task 1) │ │ (task 2) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ ┌────┴──────────────┴──────────────┴─────────────────┐ │
│ │ React Chrome Bar │ │
│ │ [Tabs] [URL] [Voice Orb] [Chat] [Tasks] │ │
│ └──────────────────┬─────────────────────────────────┘ │
│ │ WebSocket │
└─────────────────────┼───────────────────────────────────────┘
│
┌─────────────────────┼───────────────────────────────────────┐
│ FASTAPI BACKEND (Google Cloud Run) │
│ │ │
│ ┌──────────────────┴────────────────────────┐ │
│ │ 🧠 CONDUCTOR (Brain 1) │ │
│ │ Gemini Live API (Bidirectional) │ │
│ │ gemini-2.5-flash-native-audio │ │
│ │ │ │
│ │ • Real-time voice conversation │ │
│ │ • Personality + memory │ │
│ │ • Task decomposition & delegation │ │
│ └─────────────────┬──────────────────────────┘ │
│ │ execute_user_intent() │
│ ┌─────────────────┴──────────────────────────┐ │
│ │ 👁️ EXECUTOR (Brain 2) │ │
│ │ Google ADK + Gemini Vision │ │
│ │ gemini-2.5-flash │ │
│ │ │ │
│ │ • Screenshot + element map analysis │ │
│ │ • Multi-step browser automation │ │
│ │ • Visual verification of results │ │
│ │ • Reports back to Conductor │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ ☁️ GOOGLE CLOUD SERVICES │ │
│ │ • Cloud Run — serverless backend hosting │ │
│ │ • Firestore — session & conversation memory │ │
│ │ • Cloud Storage — screenshot archival │ │
│ │ • Vertex AI — Gemini model access (prod) │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
| Conductor | Executor | |
|---|---|---|
| Model | gemini-2.5-flash-native-audio |
gemini-2.5-flash |
| SDK | Google GenAI SDK (Live API) | Google ADK (Agent Development Kit) |
| Mode | Bidirectional streaming | Request-response with vision |
| Role | Voice conversation + task routing | Browser automation + verification |
| Latency | Real-time (~200ms) | Per-step (~1-3s) |
| Input | User's voice audio stream | Screenshots + DOM element map |
| Output | Speech audio + tool calls | Browser actions + status updates |
The Conductor maintains a live voice conversation while the Executor works autonomously in background tabs. You can keep browsing while the agent works.
| Service | Purpose | Why |
|---|---|---|
| Cloud Run | Backend hosting | WebSocket support, session affinity, auto-scaling, serverless |
| Firestore | Session persistence | Conversation history, page memory for stateful browsing |
| Cloud Storage | Screenshot archive | Persists visual context across sessions |
| Vertex AI | Gemini model access | Production-grade Gemini 2.5 Flash + Native Audio |
| Artifact Registry | Container images | Docker image storage for Cloud Run |
| Cloud Build | CI/CD | Automated container builds on deploy |
Don't want to build from source? Download the standalone
.exefrom the Releases tab, add your API key, and run.
- Node.js 18+ and npm
- Python 3.12+
- Google AI API key — get one at aistudio.google.com
git clone https://github.com/ma1orek/Lobster.git
cd Lobstercd backend
pip install -r requirements.txtCreate backend/.env:
GOOGLE_API_KEY=your-gemini-api-key-hereStart the backend:
uvicorn main:app --host 0.0.0.0 --port 8080cd electron
npm install
npm startThe browser window opens. Allow microphone access and start talking to Lobster!
Double-click start.bat in the project root — it launches both backend and Electron app.
Lobster includes full infrastructure-as-code for deploying the backend to Google Cloud Run.
cd deploy
chmod +x deploy.sh
./deploy.shThis script:
- Enables required GCP APIs (Cloud Run, Firestore, Storage, Vertex AI)
- Creates Firestore database for session persistence
- Creates Cloud Storage bucket for screenshot archival
- Builds and deploys the backend container to Cloud Run
- Outputs the WebSocket URL to configure in the Electron app
cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID and API key
terraform init
terraform apply| Setting | Value |
|---|---|
| Memory | 1 Gi |
| CPU | 1 |
| Min instances | 1 (always warm) |
| Max instances | 3 |
| Timeout | 3600s (1 hour — long-lived WebSocket) |
| Session affinity | Enabled |
When you say "Send a message to John on LinkedIn":
- Conductor hears your voice via Gemini Live API bidirectional streaming
- Conductor calls
execute_user_intent("Send a message to John on LinkedIn") - Backend creates a dedicated agent tab, navigates to LinkedIn
- Executor receives screenshot (vision) + element map → plans steps
- Executor calls tools:
click_by_ref(ref=5),type_into_ref(ref=12, text="Hey John!") - After each action, a new screenshot + fresh element map is captured
- Executor verifies completion visually, calls
done(summary="Message sent") - Conductor speaks the result: "Done! Message sent to John on LinkedIn!"
Instead of fragile CSS selectors, Lobster uses a numbered element reference system:
PAGE ELEMENTS:
#0 BTN "Send Message"
#1 INPUT "Search..." (placeholder)
#2 LINK "John Smith" href="/in/johnsmith"
#3 TEXTAREA "Write a message..." (contenteditable)
Each interactive DOM element gets a data-lobster-id attribute. The agent uses click_by_ref(ref=0) — 100% accurate, no coordinate guessing.
| Tool | Description |
|---|---|
click_by_ref(ref) |
Click element by map ID (primary) |
type_into_ref(ref, text) |
Type into input by map ID |
navigate(url) |
Open a URL |
execute_js(code) |
Run JavaScript on page |
wait_for(condition, target) |
Smart wait (page_load, network_idle, element_visible) |
click(x, y) |
Click by coordinates (fallback for canvas) |
scroll(direction) |
Scroll page |
drag(from, to) |
Drag between coordinates |
draw_path(points) |
Draw on canvas |
done(summary) |
Signal task completion |
Lobster captures screenshots from agent tabs even when not visible to the user:
- Active tabs: Standard Electron
capturePage()API - Background tabs: Chrome DevTools Protocol
Page.captureScreenshot— no flickering - All tabs have
backgroundThrottling: falseto prevent Chromium from pausing rendering
| Technology | Purpose |
|---|---|
| Electron 40 | Desktop browser shell with WebContentsView |
| React 19 | UI components (chrome bar, voice orb, task panel) |
| TypeScript | Type safety |
| Framer Motion | Animations and transitions |
| Tailwind CSS 4 | Styling |
| Technology | Purpose |
|---|---|
| FastAPI | Web framework + WebSocket server |
| Google ADK | Agent Development Kit — Executor agent |
| Google GenAI SDK | Gemini Live API — Conductor voice |
| Pillow | Screenshot image processing |
| NumPy | Audio signal processing |
| Model | Role |
|---|---|
gemini-2.5-flash-native-audio-latest |
Conductor — real-time voice (Live API) |
gemini-2.5-flash |
Executor — browser automation with vision |
Lobster/
├── backend/
│ ├── main.py # FastAPI server, Conductor + Executor
│ ├── skills/ # Modular browser automation skills
│ ├── requirements.txt
│ ├── Dockerfile # Cloud Run container
│ └── .env # GOOGLE_API_KEY (create this)
├── electron/
│ ├── src/
│ │ ├── index.ts # Main process: tabs, screenshots, IPC
│ │ ├── App.tsx # React UI: chrome bar, voice, chat
│ │ ├── preload.ts # Context bridge
│ │ └── components/ # Aurora, VoiceOrb, TaskPanel, etc.
│ ├── package.json
│ └── forge.config.ts # Electron Forge build config
├── terraform/ # GCP infrastructure-as-code
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── deploy/
│ ├── deploy.sh # One-click Cloud Run deploy
│ └── cloudbuild.yaml
├── public/ # Static assets (lobster.svg)
├── start.bat # Windows quick-start
└── LICENSE # Apache 2.0
| Variable | Required | Description |
|---|---|---|
GOOGLE_API_KEY |
Yes | Gemini API key from AI Studio |
GCP_PROJECT_ID |
For deploy | Google Cloud project ID |
FIRESTORE_DB |
Optional | Firestore database name |
GCS_BUCKET |
Optional | Cloud Storage bucket for screenshots |
cd electron
npm run makeThis produces a Windows .exe installer in electron/out/make/. For a portable ZIP:
npm run package
# Output in electron/out/lobster-browser-win32-x64/Apache License 2.0 — Copyright 2026 Bartosz Idzik (@ma1orek)
Built with Gemini Live API + Google ADK for the Gemini Live Agent Challenge
#GeminiLiveAgentChallenge