This directory contains mocked frontend websites designed to test browser-operating AI agents' capabilities.
- URL:
/gbr/ - Difficulty: Easy
- Purpose: Test AI Agent's page navigation, clicking, and information gathering capabilities
- Features:
- Multi-page news website
- Header navigation
- Search functionality
- Article cards with click tracking
- Subscribe/Sign-in buttons
- URL:
/techforum/ - Difficulty: Medium
- Purpose: Test AI Agent's ability to interact with Q&A forum websites
- Features:
- Question/Answer cards
- Like/Collect/Comment/Share buttons
- Comment modal with text input
- Topic navigation
- Sidebar navigation
- Search functionality
- URL:
/cloudstack/(legacy:/aliyun/) - Difficulty: Hard
- Purpose: Test AI Agent's ability to handle complex enterprise consoles with distractions
- Features:
- Complex dashboard layout
- Instance management table
- Filter and search functionality
- Create instance modal with multi-step form
- Spam popups that appear at intervals (promotions, security alerts, notifications)
- Notification panel
- Multiple action buttons per row
- URL:
/dataflow/ - Difficulty: Medium
- Purpose: Test visual understanding through dashboard interactions
- Features:
- Settings panel with toggle switches
- Revenue chart with interactive elements
- Tab navigation (Revenue, Settings, Reports)
- Quarterly data visualization
- URL:
/finviz/ - Difficulty: Medium
- Purpose: Test complex filter interactions with financial data
- Features:
- 27 dropdown filter options
- Sortable data table (40 stocks)
- Pagination controls
- Multiple view modes (Overview, Valuation, Financial, etc.)
- Dark theme matching original finviz.com
- URL:
/bluebook/ - Difficulty: Hard
- Purpose: Test Xiaohongshu-style visual browsing, search, dense card layouts, modal note reading, and comment interactions
- Features:
- Dense masonry feed with 70+ mocked posts
- Search bar with separate clear/search icon buttons
- Floating "graphic only" and "reload" buttons
- Note detail modal with left media area and right comment panel
- Comment like / reply interactions with author-specific tracking
- Shared tracker integration plus site-specific events
All websites include comprehensive event tracking that records:
- Clicks (element, position, text)
- Scrolls (position, max scroll)
- Input (field, value length)
- Hovers (element, selector)
- Navigation (page changes)
- Form submissions
- Site-specific actions (upvote, comment, instance operations, etc.)
- Events are stored in browser localStorage
- Events are also sent to server via
/api/trackendpoint - Server maintains in-memory event store
curl http://localhost:PORT/api/eventscurl http://localhost:PORT/api/events/clearcurl http://localhost:PORT/api/sitescurl http://localhost:PORT/api/helpcurl -X POST http://localhost:PORT/api/track \
-H "Content-Type: application/json" \
-d '{"eventType": "click", "site": "globalbusinessreview.com", ...}'cd eval
python server.pyThe server will:
- Automatically find an available port
- Start serving the websites
- Print URLs for all sites and API endpoints
Automated evaluation now requires a browser UUID capability token copied from the Chrome extension UUID page.
Quick start:
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias defaultRecommended options:
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
python eval/evaluate_browser_agent.py --test techforum --model-alias default
python eval/evaluate_browser_agent.py --test techforum --model-alias plus
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
python eval/evaluate_browser_agent.py --list
python eval/evaluate_browser_agent.py --manual --test techforumNotes:
--chrome-uuidis required for automated runs that call the OpenBrowser browser-control APIs.- Automated evaluation also requires at least one
--model-alias, which must match a configured LLM alias in the OpenBrowser web UI. --manualand--listdo not require a browser UUID.OPENBROWSER_CHROME_UUIDis the equivalent environment variable for scripting and CI-style usage.
After an AI agent interacts with the websites, you can:
- Export events:
GET /api/eventsreturns all tracked events in JSON format - Analyze behavior: Events include timestamps, element selectors, action types
- Compare sessions: Each session has a unique ID for comparison
- Clear and reset: Use
/api/events/clearto reset between tests
{
"timestamp": 1710234567890,
"sessionId": "session_1710234567890_abc123",
"site": "techforum.com",
"difficulty": "medium",
"page": "/techforum/",
"eventType": "click",
"element": "BUTTON",
"elementId": null,
"elementClass": "action-btn upvote",
"elementText": "👍 2,341",
"selector": "button.action-btn.upvote",
"x": 450,
"y": 320
}eval/
├── server.py # Python server with tracking API
├── evaluate_browser_agent.py # Evaluation runner
├── dataset/ # YAML test case definitions
│ ├── gbr.yaml
│ ├── gbr_detailed.yaml
│ ├── techforum.yaml
│ ├── techforum_reply.yaml
│ ├── cloudstack.yaml
│ ├── cloudstack_interactive.yaml
│ ├── finviz_simple.yaml
│ ├── finviz_complex.yaml
│ └── dataflow.yaml
├── css/
│ ├── gbr.css # GBR styles
│ ├── techforum.css # TechForum styles
│ ├── aliyun.css # Aliyun styles
│ └── finviz.css # Finviz styles
├── js/
│ ├── tracker.js # Shared tracking library
│ ├── gbr.js # GBR interactions
│ ├── techforum.js # TechForum interactions
│ ├── aliyun.js # Aliyun interactions
│ └── finviz.js # Finviz interactions
├── gbr/ # News website
│ ├── index.html
│ └── articles/
├── techforum/ # Q&A forum
│ └── index.html
├── cloudstack/ # Enterprise console (aliyun clone)
│ └── *.html
├── dataflow/ # Dashboard visualization
│ └── index.html
└── finviz/ # Stock screener
└── index.html
To manually test the websites:
- Start the server:
python server.py - Open browser to the displayed URL (e.g.,
http://localhost:11826/ws/) - Interact with the website (click, scroll, input)
- Check events:
curl http://localhost:11826/api/events
After an AI agent interacts with a website, you can analyze the tracked events to evaluate its performance. Here are some example evaluation criteria:
- Navigation: Did the agent navigate between pages (Home, World, Business, Markets, etc.)?
- Information gathering: Did the agent click on article links to read content?
- Search: Did the agent use the search functionality?
- Subscription: Did the agent attempt to subscribe or sign in?
- Button distinction: Did the agent correctly distinguish between like, collect, comment, and share buttons?
- Comment placement: Did the agent open the comment modal and submit a comment on the correct answer?
- Scrolling: Did the agent scroll through the feed to view more content?
- Navigation: Did the agent use sidebar and header navigation?
- Popup handling: Did the agent close spam popups (promotions, security alerts, etc.)?
- Complex UI interaction: Did the agent interact with the instance table, filters, and pagination?
- Multi-step process: Did the agent initiate and progress through the "Create Instance" modal?
- Action selection: Did the agent perform appropriate instance actions (start, restart, connect, etc.)?
- Settings interaction: Did the agent enable the weekly reports feature?
- Chart interaction: Did the agent click on the quarter with highest revenue?
- Tab navigation: Did the agent navigate to the Revenue tab?
- Filter application: Did the agent apply the correct market cap filter?
- Multi-filter combination: Did the agent apply multiple filters correctly?
- Data interpretation: Did the agent understand the filter results?
- Event completeness: Did the agent trigger expected event types (clicks, scrolls, inputs)?
- Session duration: How long did the agent spend on the task?
- Error rate: Did the agent trigger any error events or fail to complete key actions?
Use the /api/events endpoint to retrieve JSON data. You can write scripts to compute metrics such as:
- Total number of events per type
- Sequence of navigation events
- Time between key actions
- Completion of predefined task flows
Example analysis script:
import requests
import json
events = requests.get('http://localhost:PORT/api/events').json()
clicks = [e for e in events['events'] if e['eventType'] == 'click']
print(f"Total clicks: {len(clicks)}")