Lobster

The World's First Native Live-Agent Browser
Voice-controlled autonomous web agent powered by Gemini Live API + Google ADK

Gemini Live Agent Challenge • Category: UI Navigator

Demo Video

🎬 Watch the 4-minute demo →

Shows real-time voice conversation, multi-tab autonomous browsing, and task completion — all hands-free.

What is Lobster?

Lobster is an AI-native web browser with a built-in live voice agent. You talk to it, it talks back — and it autonomously controls the browser to complete tasks for you.

Unlike browser extensions or copilots, Lobster is built from scratch as an Electron desktop app where the AI agent is a first-class citizen — it has its own tabs, its own vision, and works independently of what you're looking at.

Core innovation: Two-Brain Architecture combining Gemini Live API (real-time bidirectional voice) with Google ADK (autonomous browser executor) — the Conductor hears and speaks, the Executor sees and acts.

Key Capabilities

Always-on voice conversation — speak naturally, no push-to-talk, barge-in support
Autonomous browser control — agent navigates, clicks, types, scrolls in its own tabs
Vision-based understanding — screenshots + numbered DOM element map (100% accurate clicking)
Multi-tab parallel execution — "Compare prices on Amazon, eBay, and Walmart" → 3 tabs simultaneously
Background operation — agent works in background tabs while you browse freely
Scheduled tasks (Cron) — "Check Reddit every 5 minutes for new posts"
ReAct reasoning — see the agent's thought process in real-time

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ELECTRON DESKTOP APP                     │
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ User Tab │  │ Agent Tab│  │ Agent Tab│  ...              │
│  │ (active) │  │ (task 1) │  │ (task 2) │                  │
│  └──────────┘  └──────────┘  └──────────┘                  │
│       │              │              │                        │
│  ┌────┴──────────────┴──────────────┴─────────────────┐    │
│  │              React Chrome Bar                       │    │
│  │   [Tabs] [URL] [Voice Orb] [Chat] [Tasks]         │    │
│  └──────────────────┬─────────────────────────────────┘    │
│                     │ WebSocket                             │
└─────────────────────┼───────────────────────────────────────┘
                      │
┌─────────────────────┼───────────────────────────────────────┐
│          FASTAPI BACKEND (Google Cloud Run)                   │
│                     │                                        │
│  ┌──────────────────┴────────────────────────┐              │
│  │         🧠 CONDUCTOR (Brain 1)             │              │
│  │       Gemini Live API (Bidirectional)      │              │
│  │       gemini-2.5-flash-native-audio        │              │
│  │                                            │              │
│  │  • Real-time voice conversation            │              │
│  │  • Personality + memory                    │              │
│  │  • Task decomposition & delegation         │              │
│  └─────────────────┬──────────────────────────┘              │
│                    │ execute_user_intent()                    │
│  ┌─────────────────┴──────────────────────────┐              │
│  │         👁️ EXECUTOR (Brain 2)               │              │
│  │       Google ADK + Gemini Vision            │              │
│  │       gemini-2.5-flash                      │              │
│  │                                             │              │
│  │  • Screenshot + element map analysis        │              │
│  │  • Multi-step browser automation            │              │
│  │  • Visual verification of results           │              │
│  │  • Reports back to Conductor                │              │
│  └─────────────────────────────────────────────┘              │
│                                                              │
│  ┌──────────────────────────────────────────────┐            │
│  │          ☁️ GOOGLE CLOUD SERVICES              │            │
│  │  • Cloud Run — serverless backend hosting    │            │
│  │  • Firestore — session & conversation memory │            │
│  │  • Cloud Storage — screenshot archival       │            │
│  │  • Vertex AI — Gemini model access (prod)    │            │
│  └──────────────────────────────────────────────┘            │
└──────────────────────────────────────────────────────────────┘

Why Two Brains?

	Conductor	Executor
Model	`gemini-2.5-flash-native-audio`	`gemini-2.5-flash`
SDK	Google GenAI SDK (Live API)	Google ADK (Agent Development Kit)
Mode	Bidirectional streaming	Request-response with vision
Role	Voice conversation + task routing	Browser automation + verification
Latency	Real-time (~200ms)	Per-step (~1-3s)
Input	User's voice audio stream	Screenshots + DOM element map
Output	Speech audio + tool calls	Browser actions + status updates

The Conductor maintains a live voice conversation while the Executor works autonomously in background tabs. You can keep browsing while the agent works.

Google Cloud Services Used

Service	Purpose	Why
Cloud Run	Backend hosting	WebSocket support, session affinity, auto-scaling, serverless
Firestore	Session persistence	Conversation history, page memory for stateful browsing
Cloud Storage	Screenshot archive	Persists visual context across sessions
Vertex AI	Gemini model access	Production-grade Gemini 2.5 Flash + Native Audio
Artifact Registry	Container images	Docker image storage for Cloud Run
Cloud Build	CI/CD	Automated container builds on deploy

Quick Start

Option A: Download Pre-built Binary

Don't want to build from source? Download the standalone .exe from the Releases tab, add your API key, and run.

Option B: Build from Source

Prerequisites

Node.js 18+ and npm
Python 3.12+
Google AI API key — get one at aistudio.google.com

1. Clone & install

git clone https://github.com/ma1orek/Lobster.git
cd Lobster

2. Backend setup

cd backend
pip install -r requirements.txt

Create backend/.env:

GOOGLE_API_KEY=your-gemini-api-key-here

Start the backend:

uvicorn main:app --host 0.0.0.0 --port 8080

3. Electron app

cd electron
npm install
npm start

The browser window opens. Allow microphone access and start talking to Lobster!

Windows Quick Start

Double-click start.bat in the project root — it launches both backend and Electron app.

Cloud Deployment (Google Cloud Run)

Lobster includes full infrastructure-as-code for deploying the backend to Google Cloud Run.

One-click deploy

cd deploy
chmod +x deploy.sh
./deploy.sh

This script:

Enables required GCP APIs (Cloud Run, Firestore, Storage, Vertex AI)
Creates Firestore database for session persistence
Creates Cloud Storage bucket for screenshot archival
Builds and deploys the backend container to Cloud Run
Outputs the WebSocket URL to configure in the Electron app

Terraform (alternative)

cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID and API key
terraform init
terraform apply

Cloud Run Configuration

Setting	Value
Memory	1 Gi
CPU	1
Min instances	1 (always warm)
Max instances	3
Timeout	3600s (1 hour — long-lived WebSocket)
Session affinity	Enabled

How It Works

The Agent Loop

When you say "Send a message to John on LinkedIn":

Conductor hears your voice via Gemini Live API bidirectional streaming
Conductor calls execute_user_intent("Send a message to John on LinkedIn")
Backend creates a dedicated agent tab, navigates to LinkedIn
Executor receives screenshot (vision) + element map → plans steps
Executor calls tools: click_by_ref(ref=5), type_into_ref(ref=12, text="Hey John!")
After each action, a new screenshot + fresh element map is captured
Executor verifies completion visually, calls done(summary="Message sent")
Conductor speaks the result: "Done! Message sent to John on LinkedIn!"

Element Map System

Instead of fragile CSS selectors, Lobster uses a numbered element reference system:

PAGE ELEMENTS:
#0  BTN "Send Message"
#1  INPUT "Search..." (placeholder)
#2  LINK "John Smith" href="/in/johnsmith"
#3  TEXTAREA "Write a message..." (contenteditable)

Each interactive DOM element gets a data-lobster-id attribute. The agent uses click_by_ref(ref=0) — 100% accurate, no coordinate guessing.

Executor Tools

Tool	Description
`click_by_ref(ref)`	Click element by map ID (primary)
`type_into_ref(ref, text)`	Type into input by map ID
`navigate(url)`	Open a URL
`execute_js(code)`	Run JavaScript on page
`wait_for(condition, target)`	Smart wait (page_load, network_idle, element_visible)
`click(x, y)`	Click by coordinates (fallback for canvas)
`scroll(direction)`	Scroll page
`drag(from, to)`	Drag between coordinates
`draw_path(points)`	Draw on canvas
`done(summary)`	Signal task completion

Background Screenshot Capture

Lobster captures screenshots from agent tabs even when not visible to the user:

Active tabs: Standard Electron capturePage() API
Background tabs: Chrome DevTools Protocol Page.captureScreenshot — no flickering
All tabs have backgroundThrottling: false to prevent Chromium from pausing rendering

Tech Stack

Frontend (Electron Desktop App)

Technology	Purpose
Electron 40	Desktop browser shell with WebContentsView
React 19	UI components (chrome bar, voice orb, task panel)
TypeScript	Type safety
Framer Motion	Animations and transitions
Tailwind CSS 4	Styling

Backend (Python)

Technology	Purpose
FastAPI	Web framework + WebSocket server
Google ADK	Agent Development Kit — Executor agent
Google GenAI SDK	Gemini Live API — Conductor voice
Pillow	Screenshot image processing
NumPy	Audio signal processing

AI Models (Gemini)

Model	Role
`gemini-2.5-flash-native-audio-latest`	Conductor — real-time voice (Live API)
`gemini-2.5-flash`	Executor — browser automation with vision

Project Structure

Lobster/
├── backend/
│   ├── main.py              # FastAPI server, Conductor + Executor
│   ├── skills/              # Modular browser automation skills
│   ├── requirements.txt
│   ├── Dockerfile           # Cloud Run container
│   └── .env                 # GOOGLE_API_KEY (create this)
├── electron/
│   ├── src/
│   │   ├── index.ts         # Main process: tabs, screenshots, IPC
│   │   ├── App.tsx          # React UI: chrome bar, voice, chat
│   │   ├── preload.ts       # Context bridge
│   │   └── components/      # Aurora, VoiceOrb, TaskPanel, etc.
│   ├── package.json
│   └── forge.config.ts      # Electron Forge build config
├── terraform/               # GCP infrastructure-as-code
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
├── deploy/
│   ├── deploy.sh            # One-click Cloud Run deploy
│   └── cloudbuild.yaml
├── public/                  # Static assets (lobster.svg)
├── start.bat                # Windows quick-start
└── LICENSE                  # Apache 2.0

Environment Variables

Variable	Required	Description
`GOOGLE_API_KEY`	Yes	Gemini API key from AI Studio
`GCP_PROJECT_ID`	For deploy	Google Cloud project ID
`FIRESTORE_DB`	Optional	Firestore database name
`GCS_BUCKET`	Optional	Cloud Storage bucket for screenshots

Building the Executable

cd electron
npm run make

This produces a Windows .exe installer in electron/out/make/. For a portable ZIP:

npm run package
# Output in electron/out/lobster-browser-win32-x64/

License

_{Built with Gemini Live API + Google ADK for the Gemini Live Agent Challenge}
_{#GeminiLiveAgentChallenge}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
deploy		deploy
electron		electron
public		public
terraform		terraform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
start.bat		start.bat

Folders and files

Latest commit

History

Repository files navigation

Lobster

Demo Video

What is Lobster?

Key Capabilities

Architecture

Why Two Brains?

Google Cloud Services Used

Quick Start

Option A: Download Pre-built Binary

Option B: Build from Source

Prerequisites

1. Clone & install

2. Backend setup

3. Electron app

Windows Quick Start

Cloud Deployment (Google Cloud Run)

One-click deploy

Terraform (alternative)

Cloud Run Configuration

How It Works

The Agent Loop

Element Map System

Executor Tools

Background Screenshot Capture

Tech Stack

Frontend (Electron Desktop App)

Backend (Python)

AI Models (Gemini)

Project Structure

Environment Variables

Building the Executable

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages