OpenBrowser is a multimodal browser agent for real web tasks.
It treats browser automation as a visual and interactive systems problem, not just a DOM parsing problem. Browsers are among the most complex pieces of software most people use every day. Reading the DOM can help, but understanding the DOM is not the same thing as actually operating the page. The long-term direction we believe in is multimodal control, or at least a strongly hybrid approach.
OpenBrowser is built around that view:
- Operate pages visually through screenshots and direct browser actions
- Keep browser execution isolated from the control window
- Evaluate continuously on mocked sites and real workflows
- Treat model cost as a first-class engineering constraint
Note: OpenBrowser currently supports Chrome only through a Chrome extension. Development and evaluation are mainly done with
dashscope/qwen3.5-plusanddashscope/qwen3.5-flash.
This demo is a better representation of what OpenBrowser is trying to do than a benchmark replay. The agent searches Zillow for one-bedroom rentals in Capitol Hill, Seattle, opens and compares multiple listings, judges brightness, cleanliness, practicality, and value from the listing photos, and then produces a shortlist.
Task prompt:
Find the best 3 one-bedroom apartment rentals in Capitol Hill, Seattle on Zillow.
Prioritize places that look bright, clean, practical, and close to everyday city life. Avoid units that look dark, cramped, outdated, cluttered, or overpriced for what they offer.
Browse multiple listings (view at least 10, for better candidates), compare them visually, and return the best 3 choices with:
- a one-sentence reason,
- the rent,
- the listing link.
Watch full video: recording_zillow.webm
What this demo shows:
- Visual judgment, not just text extraction: lighting, cleanliness, layout practicality, and overall value
- Real browser-side interaction: search, open listings, compare candidates, and inspect details
- Multi-step decision making across a larger candidate set
- End-to-end output with reasons, rents, and listing links
The browser is already one of the most complicated software environments in industry: dynamic layouts, asynchronous state, popups, tab switches, scrolling containers, partial rendering, and noisy visual context all show up in routine tasks.
Humans operate browsers by looking at the page and using the mouse and keyboard. Current models still need engineering help to do that reliably, but the native control loop is still visual. That is why OpenBrowser treats screenshots and interaction primitives as central.
DOM-heavy systems such as PinchTab or OpenClaw Browser Relay can work well today, and in some tasks they may be faster or more accurate than a multimodal pipeline. But DOM understanding is not the same as being able to operate a page robustly. Our view is that the best long-term browser agent will be multimodal, or at least strongly hybrid.
OpenBrowser is not iterated by vibe alone. The repo includes mocked websites with event tracking under eval/, and meaningful changes are checked against that evaluation suite. Failed real-world behaviors become new evaluation cases.
Model capability matters, but so does price. We do not assume token costs stay cheap forever. OpenBrowser is developed with that constraint in mind, including separate handling for stronger and cheaper models.
The primary evaluation signal in this repo is the latest checked-in report:
The test set is a series of local mock websites in eval/ that simulate realistic browser tasks and record structured interaction events.
That snapshot was generated on 2026-03-30 11:17:06 and evaluates OpenBrowser on 12 tracked browser tasks across two models. We care about three things first:
- Correctness: pass/fail plus task-score coverage
- Efficiency: average execution time
- Cost: average RMB cost per task
Current snapshot:
- Overall:
24/24runs passed,100%pass rate dashscope/qwen3.5-flash:12/12passed,68.5/68.5task score,114.89saverage duration,0.075442 RMBaverage costdashscope/qwen3.5-plus:12/12passed,67.5/68.5task score,149.63saverage duration,0.291952 RMBaverage cost
| Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
|---|---|---|---|---|
dashscope/qwen3.5-flash |
12/12 passed, 68.5/68.5 |
114.89s |
0.075442 |
0.9358 |
dashscope/qwen3.5-plus |
12/12 passed, 67.5/68.5 |
149.63s |
0.291952 |
0.8774 |
On the current suite, qwen3.5-flash is the better efficiency-cost point: it keeps the same 100% pass rate, while being about 23.2% faster and 74.2% cheaper than qwen3.5-plus. qwen3.5-plus still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
Older side-by-side comparisons with OpenClaw are kept only as archived context:
Those archived results are still useful for historical tradeoff discussion, but they are not the main metric we optimize against now.
# List available tests
python eval/evaluate_browser_agent.py --list
# Set the browser capability token once
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
# Run one test with a configured LLM alias
python eval/evaluate_browser_agent.py --test techforum --model-alias default
# Run all tests with multiple configured aliases
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
# Or pass the browser UUID explicitly per run
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default--model-alias must match an LLM alias configured in the OpenBrowser web UI, such as default, plus, or flash.
See AGENTS.md for evaluation framework documentation.
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .
# For development (includes dev dependencies like pytest, black, ruff)
uv sync --group dev
# Or with pip
pip install -e ".[dev]"uv run local-chrome-server serveThe server will start at http://127.0.0.1:8765 (HTTP) and ws://127.0.0.1:8766 (WebSocket).
On first access, you'll be prompted to configure your LLM settings through the web interface:
- Open
http://localhost:8765in your browser - You'll see the Configuration Page
- Fill in your API details:
- Model: Default is
dashscope/qwen3.5-plus(also supportsdashscope/qwen3.5-flashas a cost-effective alternative) - Base URL: Default is
https://dashscope.aliyuncs.com/compatible-mode/v1 - API Key: Your API key (required)
- Model: Default is
- Optionally configure the Default Working Directory (CWD)
- Click Save and then Continue to Main Interface
Note:
- Configuration is stored in
~/.openbrowser/llm_config.json- You can modify settings anytime by clicking the βοΈ Settings button in the status bar
- Environment variables (LLM_API_KEY, LLM_MODEL, LLM_BASE_URL) are no longer supported - please use the web UI configuration
cd extension
npm install
npm run build- Open Chrome and navigate to
chrome://extensions/ - Enable Developer mode (toggle in top-right)
- Click Load unpacked
- Select the
extension/distdirectory
After installation, OpenBrowser will open a browser-internal page that shows this browser's UUID. This UUID is the permission key for controlling that specific browser instance.
Important:
- Anyone who has this UUID can operate that browser through OpenBrowser
- Do not share it casually
- Clicking the extension icon will reopen the UUID page later
By default, Chrome blocks pop-up windows, which can prevent OpenBrowser from opening new tabs when clicking links. You need to configure Chrome to allow pop-ups:
Option A: Allow pop-ups for specific sites (Recommended)
- When a pop-up is blocked, you'll see a blocked pop-up icon (π«) in the address bar
- Click the icon and select "Always allow pop-ups and redirects from [site]"
- Click Done
Option B: Allow pop-ups globally
- Open Chrome Settings:
chrome://settings/content/popups - Under "Default behavior", select Sites can send pop-ups and use redirects
- Alternatively, add specific sites to the "Allowed to send pop-ups" section
Note: If OpenBrowser clicks a link but no new tab opens, check the address bar for the blocked pop-up icon. This is a common issue for new users.
Open your browser and visit:
http://localhost:8765
You can now interact with the AI Agent through the web interface.
Before sending commands:
- Copy the browser UUID from the extension page
- Paste it into the
BROWSER UUIDfield in the frontend - Start chatting
The permission flow is:
- The Chrome extension connects to the server through WebSocket
- The server stores a
uuid -> websocketmapping for that browser - The frontend session asks the user for the UUID
- When the user sends a message, the frontend includes that UUID
- The server uses the UUID to route browser commands to the correct extension connection
This means browser control is authorized by possession of the UUID capability token.
OpenBrowser ships with skills for both Codex and OpenClaw:
skill/codex/open-browserskill/openclaw/open-browser
They are similar in purpose, but slightly different in workflow:
- The
Codexskill is tuned for Codex-style repo workflows and supports either foreground or background task execution. - The
OpenClawskill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.
Install the one that matches your local agent environment.
OpenBrowser is developed mainly against the Qwen3.5 family because it gives a useful working point on the capability-versus-cost curve for multimodal browser tasks.
In practice:
qwen3.5-plusis used for harder visual reasoning and more demanding multi-step executionqwen3.5-flashis useful when iteration speed and cost matter more than peak capability- the project treats model choice as an engineering tradeoff, not as the product itself
Learn more about Qwen3.5:
- Qwen3.5: Towards Native Multimodal Agents (Official Blog)
- Qwen3.5: Towards Native Multimodal Agents (Alibaba Cloud)
- Alibaba unveils Qwen3.5 as China's chatbot race shifts to AI agents (CNBC)
- Alibaba unveils new Qwen3.5 model for 'agentic AI era' (Reuters)
- QwenLM/Qwen3.5 (GitHub)
OpenBrowser is built around visual page understanding and direct interaction. Structured signals such as DOM can still be useful, but they are not assumed to be the whole answer.
The browser worker should not dump all state into the control window. OpenBrowser uses an independent execution path so the control model does not carry the entire browser session history.
The repo contains mocked websites, event tracking, and archived comparison runs. The goal is not just to demo well once, but to improve under regression pressure.
Browser agents are only useful if they remain practical to run. OpenBrowser therefore treats pricing and context usage as core design constraints, not afterthoughts.
- Visual AI Automation: See and interact with web pages using AI-powered visual recognition
- Browser Control: Click, type, scroll, and navigate through visual understanding and JavaScript execution
- Tab Management: Open, close, switch, and manage browser tabs with session isolation
- Data Extraction: Scrape and collect data from websites with AI understanding of page structure
- Form Filling & Submission: Automatically fill forms, submit data, and handle multi-step workflows
- Session Persistence: Maintain browser sessions, cookies, and login states across automation tasks
- Multi-Interface Access: REST API, WebSocket, and CLI for programmatic control
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Qwen3.5 Family (Multimodal LLM) β
β Qwen3.5-Plus (primary) / Qwen3.5-Flash (cost-effective)
β Visual Perception β Decision Making β Browser Control β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenBrowser Agent Server (FastAPI) β
β REST API β WebSocket β Session Management β Tool Orchestration
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Chrome Extension (Chrome DevTools) β
β Screenshots β JavaScript Execution β Tab Management β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Extension development build with watch
cd extension
npm run dev
# TypeScript type checking
npm run typecheck.
βββ server/ # FastAPI server and agent logic
β βββ agent/ # Agent orchestration
β βββ api/ # REST endpoints
β βββ core/ # Core processing logic
β βββ websocket/ # WebSocket server
βββ extension/ # Chrome extension (TypeScript)
β βββ src/
β β βββ background/ # Background script with CDP
β β βββ commands/ # Browser automation commands
β β βββ content/ # Content script for visual feedback
β βββ dist/ # Built extension
βββ frontend/ # Web UI
LGPL-3.0
This project is built upon the OpenHands SDK, which provides the foundation for our agent architecture and tool integration. We gratefully acknowledge the OpenHands team's contributions to the open-source community.
Special thanks to:
- OpenHands Team - For the excellent SDK that powers our agent system
- Qwen Team (Alibaba) - For the powerful Qwen3.5-Plus multimodal model
