Skip to content

ma1orek/Lobster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lobster Logo

Lobster

The World's First Native Live-Agent Browser
Voice-controlled autonomous web agent powered by Gemini Live API + Google ADK

Gemini Live Agent Challenge • Category: UI Navigator

Gemini Live API Google ADK Cloud Run Firestore Electron


Demo Video

🎬 Watch the 4-minute demo →

Shows real-time voice conversation, multi-tab autonomous browsing, and task completion — all hands-free.


What is Lobster?

Lobster is an AI-native web browser with a built-in live voice agent. You talk to it, it talks back — and it autonomously controls the browser to complete tasks for you.

Unlike browser extensions or copilots, Lobster is built from scratch as an Electron desktop app where the AI agent is a first-class citizen — it has its own tabs, its own vision, and works independently of what you're looking at.

Core innovation: Two-Brain Architecture combining Gemini Live API (real-time bidirectional voice) with Google ADK (autonomous browser executor) — the Conductor hears and speaks, the Executor sees and acts.

Key Capabilities

  • Always-on voice conversation — speak naturally, no push-to-talk, barge-in support
  • Autonomous browser control — agent navigates, clicks, types, scrolls in its own tabs
  • Vision-based understanding — screenshots + numbered DOM element map (100% accurate clicking)
  • Multi-tab parallel execution — "Compare prices on Amazon, eBay, and Walmart" → 3 tabs simultaneously
  • Background operation — agent works in background tabs while you browse freely
  • Scheduled tasks (Cron) — "Check Reddit every 5 minutes for new posts"
  • ReAct reasoning — see the agent's thought process in real-time

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     ELECTRON DESKTOP APP                     │
│                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │ User Tab │  │ Agent Tab│  │ Agent Tab│  ...              │
│  │ (active) │  │ (task 1) │  │ (task 2) │                  │
│  └──────────┘  └──────────┘  └──────────┘                  │
│       │              │              │                        │
│  ┌────┴──────────────┴──────────────┴─────────────────┐    │
│  │              React Chrome Bar                       │    │
│  │   [Tabs] [URL] [Voice Orb] [Chat] [Tasks]         │    │
│  └──────────────────┬─────────────────────────────────┘    │
│                     │ WebSocket                             │
└─────────────────────┼───────────────────────────────────────┘
                      │
┌─────────────────────┼───────────────────────────────────────┐
│          FASTAPI BACKEND (Google Cloud Run)                   │
│                     │                                        │
│  ┌──────────────────┴────────────────────────┐              │
│  │         🧠 CONDUCTOR (Brain 1)             │              │
│  │       Gemini Live API (Bidirectional)      │              │
│  │       gemini-2.5-flash-native-audio        │              │
│  │                                            │              │
│  │  • Real-time voice conversation            │              │
│  │  • Personality + memory                    │              │
│  │  • Task decomposition & delegation         │              │
│  └─────────────────┬──────────────────────────┘              │
│                    │ execute_user_intent()                    │
│  ┌─────────────────┴──────────────────────────┐              │
│  │         👁️ EXECUTOR (Brain 2)               │              │
│  │       Google ADK + Gemini Vision            │              │
│  │       gemini-2.5-flash                      │              │
│  │                                             │              │
│  │  • Screenshot + element map analysis        │              │
│  │  • Multi-step browser automation            │              │
│  │  • Visual verification of results           │              │
│  │  • Reports back to Conductor                │              │
│  └─────────────────────────────────────────────┘              │
│                                                              │
│  ┌──────────────────────────────────────────────┐            │
│  │          ☁️ GOOGLE CLOUD SERVICES              │            │
│  │  • Cloud Run — serverless backend hosting    │            │
│  │  • Firestore — session & conversation memory │            │
│  │  • Cloud Storage — screenshot archival       │            │
│  │  • Vertex AI — Gemini model access (prod)    │            │
│  └──────────────────────────────────────────────┘            │
└──────────────────────────────────────────────────────────────┘

Why Two Brains?

Conductor Executor
Model gemini-2.5-flash-native-audio gemini-2.5-flash
SDK Google GenAI SDK (Live API) Google ADK (Agent Development Kit)
Mode Bidirectional streaming Request-response with vision
Role Voice conversation + task routing Browser automation + verification
Latency Real-time (~200ms) Per-step (~1-3s)
Input User's voice audio stream Screenshots + DOM element map
Output Speech audio + tool calls Browser actions + status updates

The Conductor maintains a live voice conversation while the Executor works autonomously in background tabs. You can keep browsing while the agent works.


Google Cloud Services Used

Service Purpose Why
Cloud Run Backend hosting WebSocket support, session affinity, auto-scaling, serverless
Firestore Session persistence Conversation history, page memory for stateful browsing
Cloud Storage Screenshot archive Persists visual context across sessions
Vertex AI Gemini model access Production-grade Gemini 2.5 Flash + Native Audio
Artifact Registry Container images Docker image storage for Cloud Run
Cloud Build CI/CD Automated container builds on deploy

Quick Start

Option A: Download Pre-built Binary

Don't want to build from source? Download the standalone .exe from the Releases tab, add your API key, and run.

Option B: Build from Source

Prerequisites

1. Clone & install

git clone https://github.com/ma1orek/Lobster.git
cd Lobster

2. Backend setup

cd backend
pip install -r requirements.txt

Create backend/.env:

GOOGLE_API_KEY=your-gemini-api-key-here

Start the backend:

uvicorn main:app --host 0.0.0.0 --port 8080

3. Electron app

cd electron
npm install
npm start

The browser window opens. Allow microphone access and start talking to Lobster!

Windows Quick Start

Double-click start.bat in the project root — it launches both backend and Electron app.


Cloud Deployment (Google Cloud Run)

Lobster includes full infrastructure-as-code for deploying the backend to Google Cloud Run.

One-click deploy

cd deploy
chmod +x deploy.sh
./deploy.sh

This script:

  1. Enables required GCP APIs (Cloud Run, Firestore, Storage, Vertex AI)
  2. Creates Firestore database for session persistence
  3. Creates Cloud Storage bucket for screenshot archival
  4. Builds and deploys the backend container to Cloud Run
  5. Outputs the WebSocket URL to configure in the Electron app

Terraform (alternative)

cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID and API key
terraform init
terraform apply

Cloud Run Configuration

Setting Value
Memory 1 Gi
CPU 1
Min instances 1 (always warm)
Max instances 3
Timeout 3600s (1 hour — long-lived WebSocket)
Session affinity Enabled

How It Works

The Agent Loop

When you say "Send a message to John on LinkedIn":

  1. Conductor hears your voice via Gemini Live API bidirectional streaming
  2. Conductor calls execute_user_intent("Send a message to John on LinkedIn")
  3. Backend creates a dedicated agent tab, navigates to LinkedIn
  4. Executor receives screenshot (vision) + element map → plans steps
  5. Executor calls tools: click_by_ref(ref=5), type_into_ref(ref=12, text="Hey John!")
  6. After each action, a new screenshot + fresh element map is captured
  7. Executor verifies completion visually, calls done(summary="Message sent")
  8. Conductor speaks the result: "Done! Message sent to John on LinkedIn!"

Element Map System

Instead of fragile CSS selectors, Lobster uses a numbered element reference system:

PAGE ELEMENTS:
#0  BTN "Send Message"
#1  INPUT "Search..." (placeholder)
#2  LINK "John Smith" href="/in/johnsmith"
#3  TEXTAREA "Write a message..." (contenteditable)

Each interactive DOM element gets a data-lobster-id attribute. The agent uses click_by_ref(ref=0)100% accurate, no coordinate guessing.

Executor Tools

Tool Description
click_by_ref(ref) Click element by map ID (primary)
type_into_ref(ref, text) Type into input by map ID
navigate(url) Open a URL
execute_js(code) Run JavaScript on page
wait_for(condition, target) Smart wait (page_load, network_idle, element_visible)
click(x, y) Click by coordinates (fallback for canvas)
scroll(direction) Scroll page
drag(from, to) Drag between coordinates
draw_path(points) Draw on canvas
done(summary) Signal task completion

Background Screenshot Capture

Lobster captures screenshots from agent tabs even when not visible to the user:

  • Active tabs: Standard Electron capturePage() API
  • Background tabs: Chrome DevTools Protocol Page.captureScreenshot — no flickering
  • All tabs have backgroundThrottling: false to prevent Chromium from pausing rendering

Tech Stack

Frontend (Electron Desktop App)

Technology Purpose
Electron 40 Desktop browser shell with WebContentsView
React 19 UI components (chrome bar, voice orb, task panel)
TypeScript Type safety
Framer Motion Animations and transitions
Tailwind CSS 4 Styling

Backend (Python)

Technology Purpose
FastAPI Web framework + WebSocket server
Google ADK Agent Development Kit — Executor agent
Google GenAI SDK Gemini Live API — Conductor voice
Pillow Screenshot image processing
NumPy Audio signal processing

AI Models (Gemini)

Model Role
gemini-2.5-flash-native-audio-latest Conductor — real-time voice (Live API)
gemini-2.5-flash Executor — browser automation with vision

Project Structure

Lobster/
├── backend/
│   ├── main.py              # FastAPI server, Conductor + Executor
│   ├── skills/              # Modular browser automation skills
│   ├── requirements.txt
│   ├── Dockerfile           # Cloud Run container
│   └── .env                 # GOOGLE_API_KEY (create this)
├── electron/
│   ├── src/
│   │   ├── index.ts         # Main process: tabs, screenshots, IPC
│   │   ├── App.tsx          # React UI: chrome bar, voice, chat
│   │   ├── preload.ts       # Context bridge
│   │   └── components/      # Aurora, VoiceOrb, TaskPanel, etc.
│   ├── package.json
│   └── forge.config.ts      # Electron Forge build config
├── terraform/               # GCP infrastructure-as-code
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
├── deploy/
│   ├── deploy.sh            # One-click Cloud Run deploy
│   └── cloudbuild.yaml
├── public/                  # Static assets (lobster.svg)
├── start.bat                # Windows quick-start
└── LICENSE                  # Apache 2.0

Environment Variables

Variable Required Description
GOOGLE_API_KEY Yes Gemini API key from AI Studio
GCP_PROJECT_ID For deploy Google Cloud project ID
FIRESTORE_DB Optional Firestore database name
GCS_BUCKET Optional Cloud Storage bucket for screenshots

Building the Executable

cd electron
npm run make

This produces a Windows .exe installer in electron/out/make/. For a portable ZIP:

npm run package
# Output in electron/out/lobster-browser-win32-x64/

License

Apache License 2.0 — Copyright 2026 Bartosz Idzik (@ma1orek)


Built with Gemini Live API + Google ADK for the Gemini Live Agent Challenge
#GeminiLiveAgentChallenge

About

Lobster — The World's First Native Live-Agent Browser. Voice-controlled autonomous web agent powered by Gemini Live API + Google ADK. Built for the Gemini Live Agent Challenge.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors