Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
1b9bd2f
feat(openbrowser): wire sdk tool image window
softpudding Mar 28, 2026
e6add55
feat(openbrowser): prefer token-driven condensation
softpudding Mar 28, 2026
b313728
feat(openbrowser): harden visual element identification
softpudding Mar 28, 2026
182665b
Unify prompt contracts and stabilize default highlight observations
softpudding Mar 29, 2026
148b822
Pin agent-sdk flexible browser reasoning prompts
softpudding Mar 29, 2026
9c937f7
Tighten highlight label layout and spacing
softpudding Mar 29, 2026
39c577a
Refine browser tool prompts and update sdk lock
softpudding Mar 29, 2026
f8ad76b
Add geometry-first evals and pin SDK trust update
softpudding Mar 29, 2026
785c631
Add visible feedback and fit guide blocker to Northstar eval
softpudding Mar 29, 2026
455afb5
Clarify scroll_amount semantics
softpudding Mar 29, 2026
309aa80
Simplify Northstar evals and persist refresh FAB
softpudding Mar 29, 2026
6f8a070
Refine small-model browser prompts
softpudding Mar 29, 2026
ac43360
Restore clickable HTML for small-model observations
softpudding Mar 29, 2026
9ea5e9c
Improve CloudStack DAS agent chat replies
softpudding Mar 29, 2026
8dcaf9b
docs: update agents guide and evaluation report
softpudding Mar 29, 2026
887639e
Unify highlight pagination metadata
softpudding Mar 30, 2026
d2a6577
update dependency, keep last n images in context
softpudding Mar 30, 2026
ba13c90
update evaluation result
softpudding Mar 30, 2026
4ffd450
Fix highlight label CI regressions
softpudding Mar 30, 2026
845b260
Speed up highlight pagination
softpudding Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 27 additions & 8 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,13 @@ OpenBrowser uses Jinja2 templates for agent prompts, enabling dynamic content in
- **Clean output**: `trim_blocks=True` and `lstrip_blocks=True` remove extra whitespace
- **Caching**: Templates are cached after first load for performance

### Model Profile Differences
- Model profile is resolved from session metadata and exposed to prompt rendering as `model_profile` / `small_model`; see `server/agent/manager.py` and `server/agent/tools/prompt_context.py`
- Tool prompt variants are split by model profile under `server/agent/prompts/small_model/` and `server/agent/prompts/big_model/`
- Small-model browser guidance intentionally avoids `keywords` fallback and leans harder on same-mode highlight pagination when dense UI may be split across collision-aware pages
- Observation rendering also differs by model profile: large models keep clickable highlights compact (`... and N clickable elements`), while small models include clickable element HTML in the LLM-visible observation text for extra semantic grounding
- The small-model clickable-observation branch is implemented in `server/agent/tools/base.py`; the per-conversation `small_model` flag is attached in `server/agent/tools/browser_executor.py`

### Keyword Discipline
- Highlight pagination remains the default discovery flow for controls and dense UI
- After any significant page-state change, restart discovery with `highlight_elements(element_type="any")` before choosing the next element
Expand Down Expand Up @@ -206,10 +213,13 @@ Elements are paginated to ensure **no visual overlap** in each screenshot:
- Reason: OpenBrowser intentionally keeps automated tabs in the browser background, and Chrome may heavily throttle hidden-tab timers. A page-side `setTimeout` stability loop can therefore take far longer than its nominal budget and become the main cause of highlight timeouts.
- In practice, the main cause of unstable first-highlight screenshots is often **missing warmup**, not a bad readiness classifier. A background tab may answer lightweight `Runtime.evaluate` probes while still sitting in a partially painted / partially decoded state.
- A screenshot-style warmup is therefore the default precondition for `highlight_elements`. It helps force hidden-tab paint/compositor/image-decode work before interactive-element detection runs.
- All highlight warmup and highlight screenshot captures now reuse the same screenshot wake-up profile as `tab view` (`TAB_VIEW_SCREENSHOT_CAPTURE_OPTIONS`) instead of a weaker highlight-only profile. The goal is consistency: if a screenshot is needed to wake the page, the highlight path should not use a different, less effective capture mode.
- For navigation-driven default observations such as `tab init`, `tab open`, `tab switch`, `tab refresh`, `tab back`, and `tab forward`, the extension now performs an **internal raw screenshot prime** first, then runs the normal highlight warmup + detection + highlighted screenshot flow. That raw prime screenshot is only for waking the background page and is **not** returned to the agent.
- If `highlight_elements` keeps returning `not_ready` but `tab view` immediately makes the next highlight succeed, treat that as a warmup issue first.
- The extension samples viewport readiness signals once per attempt: document readiness, viewport text/media density, pending images, and loading placeholders such as skeleton/shimmer/spinner indicators.
- Readiness is graded as `ready`, `provisionally_ready`, or `not_ready`.
- If readiness is `not_ready`, the extension performs only a couple of short **background-side** retries before proceeding or returning the latest result.
- The screenshot-side wake-up itself also runs a bounded pre-capture warmup loop. It touches visible viewport media, samples readiness, and retries only a couple of times when the snapshot still looks `not_ready`.
- After screenshot capture, highlight still runs a **consistency check**. This is a drift detector, not a loading detector: it verifies whether sampled highlighted elements moved or disappeared between detection and screenshot.
- Design rule: prefer snapshot classification plus bounded retries; avoid depending on repeated timers inside the target page for highlight stability.

Expand Down Expand Up @@ -321,10 +331,10 @@ OpenBrowser has explicit screenshot control for maximum flexibility:

| Command | Auto-Screenshot | Notes |
|---------|------------------|-------|
| `tab init` | Yes | Verify page load |
| `tab open` | Yes | Verify new tab |
| `tab switch` | Yes | Verify tab switch |
| `tab refresh` | Yes | Verify refresh result |
| `tab init` | Yes | Returns default `highlight any page 1`; first does an internal raw screenshot prime to wake the page |
| `tab open` | Yes | Returns default `highlight any page 1`; first does an internal raw screenshot prime to wake the page |
| `tab switch` | Yes | Returns default `highlight any page 1`; first does an internal raw screenshot prime to wake the page |
| `tab refresh` | Yes | Returns default `highlight any page 1`; first does an internal raw screenshot prime to wake the page |
|---------|------------------|-------|
| `highlight_elements` | Yes | Visual overlay for element selection |
| `click_element` | Yes | Verify interaction result |
Expand Down Expand Up @@ -367,7 +377,9 @@ Automated testing framework for evaluating AI agent performance on browser autom
```
OpenBrowser/eval/
├── evaluate_browser_agent.py # Main evaluation entry point
├── dataset/ # YAML test case definitions (9 tests)
├── dataset/ # YAML test case definitions (12 tests)
│ ├── bluebook_simple.yaml # BlueBook search and like test
│ ├── bluebook_complex.yaml # BlueBook multi-image reply test
│ ├── gbr.yaml # GBR search test
│ ├── gbr_detailed.yaml # GBR detailed search test
│ ├── techforum.yaml # TechForum upvote test
Expand All @@ -376,10 +388,11 @@ OpenBrowser/eval/
│ ├── cloudstack_interactive.yaml # CloudStack DAS interactive test
│ ├── finviz_simple.yaml # Finviz simple screener test
│ ├── finviz_complex.yaml # Finviz multi-filter test
│ └── dataflow.yaml # DataFlow visual challenge test
│ ├── dataflow.yaml # DataFlow visual challenge test
│ └── northstar_add_bag.yaml # Combined fit-guide and add-to-bag geometry test
├── output/ # Generated results and images
├── server.py # Mock websites server with tracking API
└── (mock websites: gbr/, techforum/, cloudstack/, dataflow/, finviz/)
└── (mock websites: gbr/, techforum/, cloudstack/, dataflow/, finviz/, bluebook/, northstar/)
```

### Key Features
Expand Down Expand Up @@ -550,17 +563,19 @@ Tests are defined in YAML format with:
| `gbr` | GBR Search Test | easy | 400s (~6.7min) | 0.8 RMB | Search for "fed" related news |
| `finviz_simple` | Finviz Simple Screener Test | easy | 300s (5min) | 0.8 RMB | Filter stocks by market cap over 10 billion |
| `techforum` | TechForum Upvote Test | medium | 300s (5min) | 0.5 RMB | Upvote the first AI-related post |
| `bluebook_simple` | BlueBook Search And Like Test | medium | 300s (5min) | 0.6 RMB | Search for the target note and like it |
| `gbr_detailed` | GBR Detailed Search & Read Test | medium | 600s (10min) | 1.5 RMB | Search for "fed", click into each article (3 articles), and summarize content |
| `finviz_complex` | Finviz Multi-Filter Screener Test | medium | 400s (~6.7min) | 1.0 RMB | Multi-filter stock screener: market cap, P/E, volume |
| `dataflow` | DataFlow Visual Challenge Test | medium | 300s (5min) | 0.5 RMB | Dashboard interactions: settings, reports, navigation |
| `northstar_add_bag` | Northstar Fit Guide + Add To Bag Test | medium | 540s (9min) | 1.2 RMB | Save the Care & Wash fit guide section, then choose size M and add the shell to bag |

#### Advanced Tests
| ID | Name | Difficulty | Time Limit | Cost Limit | Description |
|----|------|------------|------------|------------|-------------|
| `bluebook_complex` | BlueBook Multi-Image Reply Test | hard | 500s (~8.3min) | 1.2 RMB | Search for the OpenClaw note, view all images, and leave a quick comment |
| `cloudstack` | CloudStack DAS Agent Test | hard | 500s (~8.3min) | 1.2 RMB | Find DAS console and greet DAS agent |
| `techforum_reply` | TechForum Comment Reply Test | hard | 500s (~8.3min) | 1.0 RMB | Open comments, find "Graduate Student" comment, reply with paper name |
| `cloudstack_interactive` | CloudStack DAS Interactive Test | very hard | 700s (~11.7min) | 2.0 RMB | Multi-turn conversation with DAS agent: greeting, system status, storage check |

#### Event Matching Notes
- **Standard events**: `page_view`, `click`, `input`, `submit`, `hover`, `scroll`, `answer_action`
- **Special event types**:
Expand All @@ -586,6 +601,10 @@ Criteria match tracked events using flexible pattern matching:
- Page URLs, input values, custom fields
- Alternative conditions for flexible scoring

### Deferred Prompt And Observation Follow-Ups
- Observation design: add structured geometry hints such as `partly_visible`, `near_viewport_edge`, `occluded_by_sticky_ui`, explicit scroll-container identity, and structured stale-element causes before expanding prompt text again.
- Prompt compaction: after geometry-focused eval results stabilize, reduce duplicated rules between the SDK system prompt and tool prompts so tool templates keep only tool-local contracts and recovery guidance.

## NOTES

- **Git dependencies:** `openhands-sdk` and `openhands-tools` from git subdirectories
Expand Down
71 changes: 70 additions & 1 deletion eval/cloudstack/js/das-agent.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,75 @@ document.addEventListener('DOMContentLoaded', function() {
this.style.height = (this.scrollHeight) + 'px';
});
}

function normalizeMessage(message) {
return message.toLowerCase().replace(/\s+/g, ' ').trim();
}

function containsAny(text, keywords) {
return keywords.some(keyword => text.includes(keyword));
}

function buildAgentReply(message) {
const normalizedMessage = normalizeMessage(message);

const greetingKeywords = [
'hello',
'hi',
'hey',
'greetings',
'good morning',
'good afternoon',
'good evening'
];
const statusKeywords = [
'status',
'system',
'health',
'report',
'running',
'current state',
'how are you'
];
const storageKeywords = [
'storage',
'disk',
'space',
'capacity',
'usage',
'utilization',
'volume'
];
const cpuKeywords = ['cpu', 'load'];
const memoryKeywords = ['memory', 'ram'];
const alertKeywords = ['alert', 'warning', 'alarm', 'incident', 'issue'];

if (containsAny(normalizedMessage, storageKeywords)) {
return 'Storage usage check complete: primary cluster is at 68% used, log volume is at 42%, and free capacity is enough for current workload. No immediate storage risk detected.';
}

if (containsAny(normalizedMessage, statusKeywords)) {
return 'Current system status is stable. Core database services are online, replication delay is within threshold, and there are no critical incidents at the moment.';
}

if (containsAny(normalizedMessage, cpuKeywords)) {
return 'CPU load is moderate right now, averaging around 34% across the main database nodes. No hot node is currently flagged.';
}

if (containsAny(normalizedMessage, memoryKeywords)) {
return 'Memory usage is healthy. Working set pressure is low and cache hit rate remains within the expected range.';
}

if (containsAny(normalizedMessage, alertKeywords)) {
return 'There are no active P1 alerts. I only see a few low-priority optimization suggestions related to slow-query tuning and index review.';
}

if (containsAny(normalizedMessage, greetingKeywords)) {
return 'Hello. I am DAS Agent. I can help with system status, storage usage, alerts, and database operations checks.';
}

return 'I can help with database operations. You can ask me for current system status, storage usage, performance health, or active alerts.';
}

// Send message function
function sendMessage() {
Expand Down Expand Up @@ -74,7 +143,7 @@ document.addEventListener('DOMContentLoaded', function() {

// Simulate agent response delay
setTimeout(function() {
addAgentMessage('Hello, I am DAS Agent');
addAgentMessage(buildAgentReply(message));
sendBtn.disabled = false;
sendBtn.textContent = 'Send Message';
}, 800);
Expand Down
43 changes: 43 additions & 0 deletions eval/dataset/northstar_add_bag.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
id: northstar_add_bag
name: "Northstar Fit Guide + Add To Bag Test"
difficulty: medium
description: "Open the fit guide, save the Care & Wash section, then reposition the purchase rail, choose size M, and add the shell to bag."
start_url: "http://localhost:16605/northstar/"
instruction: "On the Northstar Outfitters Commuter Shell page, open the fit guide, scroll inside it until the Care & Wash section is centered, save the guide from that section, then choose size Medium and add the jacket to your bag."
time_limit: 540.0
cost_limit: 1.2

criteria:
- type: open_fit_guide
description: "Open the fit guide drawer"
points: 0.5
expected:
event_type: fit_guide_open
page: "/northstar/"
drawer: "fit-guide"

- type: save_fit_guide
description: "Save the fit guide from the Care & Wash section"
points: 2.0
expected:
event_type: fit_guide_save
page: "/northstar/"
section: "care-wash"

- type: select_medium_size
description: "Select size Medium"
points: 1.5
expected:
event_type: product_size_select
page: "/northstar/"
productId: "commuter-shell"
size: "M"

- type: add_to_bag
description: "Add the selected shell to bag"
points: 2.0
expected:
event_type: product_add_to_bag
page: "/northstar/"
productId: "commuter-shell"
size: "M"
Loading
Loading