Skip to content

Commit 457c9fd

Browse files
fpagnyRoRoJ
andauthored
feat(genapi): tutorial for building a web browser agent (#5888)
* feat(genapi): tutorial for building a web browser agent This tutorial covers the creation of a web browser agent using Generative APIs, Holo2, and Selenium, detailing the setup, tasks, and automation process. * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> * fix(genapi): wording Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> --------- Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com>
1 parent 33305ef commit 457c9fd

File tree

1 file changed

+217
-0
lines changed
  • tutorials/build-web-browser-agent-generativeapis

1 file changed

+217
-0
lines changed
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
---
2+
title: Building a web browser agent with Generative APIs, Holo2 and Selenium
3+
description: Follow our step by step guide to build a web browser agent with Holo2, Selenium and Scaleway Generative APIs
4+
tags: inference API selenium browser holo2 agent vision ai
5+
dates:
6+
validation: 2025-11-28
7+
posted: 2025-11-28
8+
validation_frequency: 12
9+
products:
10+
- generative-apis
11+
difficulty: beginner
12+
usecase:
13+
- build-and-run-ai
14+
ecosystem:
15+
- scaleway-only
16+
---
17+
import Requirements from '@macros/iam/requirements.mdx'
18+
19+
# Building a web browser agent with Generative APIs, Holo2 and Selenium
20+
21+
This tutorial will guide you through creating a web agent that can interact with websites using LLM vision and browser automation. You will build on Scaleway's Generative APIs to create a system that can "see" web pages and take actions based on visual understanding. This approach is particularly useful for building more flexible pipelines compared to HTML DOM parsing, and requires less maintenance over time when website HTML code changes.
22+
23+
## Why Holo2 model ?
24+
25+
The [Holo2 model](https://huggingface.co/Hcompany/Holo2-30B-A3B) is a vision-language model optimized to understand Graphical User Interfaces (GUIs) such as web pages or mobile applications, and perform actions on them. Compared to traditional HTML DOM parsing, using a vision model allows you to build more flexible pipelines that require less maintenance over time when website code and structure changes.
26+
27+
## What you will learn
28+
- How to take screenshot and perform actions using ***Selenium***
29+
- How to analyze images and process actions output using ***Holo2 vision model***
30+
31+
<Requirements />
32+
33+
- A Scaleway account logged into the [console](https://console.scaleway.com)
34+
- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization
35+
- A valid [API key](/iam/how-to/create-api-keys/)
36+
- Access to the [Generative APIs service](/generative-apis/quickstart/)
37+
- [Python 3.11+](https://www.python.org/downloads/) installed on your local computer
38+
- [Chrome](https://www.google.com/intl/fr/chrome/gsem/download) or [Firefox](https://www.firefox.com/fr/) browser installed on your local computer
39+
40+
## Install required packages
41+
42+
Run the following command to install the required packages:
43+
```bash
44+
pip install openai selenium
45+
```
46+
47+
- `openai` enables to query Holo2 model through Generative APIs
48+
- `selenium` enables interaction with browsers
49+
50+
## Import dependencies
51+
52+
Create a new `holo2-agent.py` file with the following content:
53+
54+
```python
55+
#holo2-agent.py
56+
from openai import OpenAI
57+
from selenium import webdriver
58+
from selenium.webdriver.common.actions.action_builder import ActionBuilder
59+
import base64
60+
import os
61+
import json
62+
import time
63+
```
64+
65+
## Define tasks to perform and website to browse
66+
67+
Define what tasks you want your agent to perform and on which website:
68+
69+
```python
70+
#holo2-agent.py
71+
TASKS = ["Accept cookies", "Click on changelog", "Select newly added feature"]
72+
WEBSITE_URL = "https://www.scaleway.com/en/docs/"
73+
```
74+
For this example we will make the agent go to the Scaleway Documentation website, and look for recently added features.
75+
76+
## Create the model output structure
77+
78+
By default, Holo2 outputs text data. However, it can be guided to output coordinates in a structured way using `JSON`:
79+
80+
```python
81+
#holo2-agent.py
82+
output_structure = {
83+
'x': {
84+
'description': 'The x coordinate, normalized between 0 and 1000.',
85+
'type': 'integer'
86+
},
87+
'y': {
88+
'description': 'The y coordinate, normalized between 0 and 1000.',
89+
'type': 'integer'
90+
}
91+
}
92+
```
93+
94+
This structure ensures coordinates are valid within the expected range (0-1000), normalized and independant of the exact browser window size).
95+
96+
## Connect to Scaleway's Generative APIs
97+
98+
Set up the OpenAI client to connect to Scaleway's API:
99+
100+
```python
101+
#holo2-agent.py
102+
client = OpenAI(
103+
base_url="https://api.scaleway.ai/v1",
104+
api_key=os.getenv("SCW_SECRET_KEY") # Store your IAM API KEY in this environment variable or replace directly with your IAM API key
105+
)
106+
```
107+
108+
## Identify where to click on an image
109+
110+
Create a function that sends a task and an interface screenshot to the AI model and output (x,y) coordinates on the screen. The coordinates correspond to the location the agent needs to click to perform the task:
111+
112+
```python
113+
#holo2-agent.py
114+
def get_next_action(task):
115+
# Read and encode the current screenshot
116+
with open("current_screen.png", "rb") as file:
117+
image_content = file.read()
118+
base64_img = base64.b64encode(image_content).decode("utf-8")
119+
120+
# Send the image and task to the vision model
121+
response = client.chat.completions.create(
122+
model="holo2-30b-a3b",
123+
messages=[
124+
{
125+
"role": "user",
126+
"content": [
127+
{
128+
"type": "image_url",
129+
"image_url": {
130+
"url": f"data:image/png;base64,{base64_img}",
131+
},
132+
},
133+
{
134+
"type": "text",
135+
"text": f"""Localize an element on the GUI image according to the provided target and output a click position.
136+
* You must output a valid JSON following the format: {json.dumps(output_structure)}
137+
Your target is: {task}"""
138+
}
139+
]
140+
}
141+
],
142+
max_tokens=10000,
143+
temperature=0.8,
144+
top_p=0.95
145+
)
146+
147+
# Parse the response
148+
next_action = json.loads(response.choices[0].message.content)
149+
return next_action
150+
```
151+
152+
## Automate browser actions
153+
154+
Use Selenium to:
155+
- Take a screenshot of the current page
156+
- Click on the right location to perform the task.
157+
158+
Click coordinates are retrieved from Holo2 using the `get_next_action` function. These coordinates are adjusted to exact browser coordinates using `window.innerWidth` and `window.innerHeight` browser properties.
159+
160+
```python
161+
#holo2-agent.py
162+
driver = webdriver.Chrome() # Use webdriver.Firefox() for Firefox
163+
164+
try:
165+
# Navigate to the target website
166+
driver.get(WEBSITE_URL)
167+
time.sleep(3) # Wait for the page to load
168+
169+
# Process each task in sequence
170+
for task in TASKS:
171+
# Take a screenshot of the current page
172+
driver.save_screenshot('current_screen.png')
173+
174+
# Get the current page dimensions
175+
page_width = driver.execute_script("return window.innerWidth;")
176+
page_height = driver.execute_script("return window.innerHeight;")
177+
178+
# Get click position from the AI model
179+
next_action = get_next_action(task)
180+
181+
# Convert normalized coordinates to actual screen coordinates
182+
click_x = (next_action['x'] / 1000) * page_width
183+
click_y = (next_action['y'] / 1000) * page_height
184+
185+
# Create and perform the click action
186+
action = ActionBuilder(driver)
187+
action.pointer_action.move_to_location(click_x, click_y)
188+
action.pointer_action.click()
189+
action.perform()
190+
191+
print(f"Performing task: {task}. Clicked at coordinates: X={click_x}, Y={click_y}")
192+
time.sleep(3) # Wait to see the result
193+
194+
finally:
195+
# Always close the browser
196+
driver.quit()
197+
```
198+
199+
## Run the agent
200+
201+
Execute the agent with:
202+
```bash
203+
SCW_SECRET_KEY="your_scaleway_secret_key" \
204+
python holo2-agent.py
205+
```
206+
207+
You should see your browser open and perform the given actions until the most recently added features are displayed.
208+
209+
Congratulations! You have built a web browser agent navigating through website only based on text written tasks.
210+
211+
## Going further
212+
213+
- Add different tasks and see the agent adapt to them
214+
- Add other actions types such as scrolling through the page or typing text.
215+
- Use [Pydantic](https://docs.pydantic.dev/latest/) to define output structure and check its validity
216+
- Extract information from the browsed page
217+
- Use the agent with "thick" client applications, such as desktop or mobile applications (without Selenium)

0 commit comments

Comments
 (0)