|
| 1 | +--- |
| 2 | +title: Building a web browser agent with Generative APIs, Holo2 and Selenium |
| 3 | +description: Follow our step by step guide to build a web browser agent with Holo2, Selenium and Scaleway Generative APIs |
| 4 | +tags: inference API selenium browser holo2 agent vision ai |
| 5 | +dates: |
| 6 | + validation: 2025-11-28 |
| 7 | + posted: 2025-11-28 |
| 8 | + validation_frequency: 12 |
| 9 | +products: |
| 10 | + - generative-apis |
| 11 | +difficulty: beginner |
| 12 | +usecase: |
| 13 | + - build-and-run-ai |
| 14 | +ecosystem: |
| 15 | + - scaleway-only |
| 16 | +--- |
| 17 | +import Requirements from '@macros/iam/requirements.mdx' |
| 18 | + |
| 19 | +# Building a web browser agent with Generative APIs, Holo2 and Selenium |
| 20 | + |
| 21 | +This tutorial will guide you through creating a web agent that can interact with websites using LLM vision and browser automation. You will build on Scaleway's Generative APIs to create a system that can "see" web pages and take actions based on visual understanding. This approach is particularly useful for building more flexible pipelines compared to HTML DOM parsing, and requires less maintenance over time when website HTML code changes. |
| 22 | + |
| 23 | +## Why Holo2 model ? |
| 24 | + |
| 25 | +The [Holo2 model](https://huggingface.co/Hcompany/Holo2-30B-A3B) is a vision-language model optimized to understand Graphical User Interfaces (GUIs) such as web pages or mobile applications, and perform actions on them. Compared to traditional HTML DOM parsing, using a vision model allows you to build more flexible pipelines that require less maintenance over time when website code and structure changes. |
| 26 | + |
| 27 | +## What you will learn |
| 28 | +- How to take screenshot and perform actions using ***Selenium*** |
| 29 | +- How to analyze images and process actions output using ***Holo2 vision model*** |
| 30 | + |
| 31 | +<Requirements /> |
| 32 | + |
| 33 | +- A Scaleway account logged into the [console](https://console.scaleway.com) |
| 34 | +- [Owner](/iam/concepts/#owner) status or [IAM permissions](/iam/concepts/#permission) allowing you to perform actions in the intended Organization |
| 35 | +- A valid [API key](/iam/how-to/create-api-keys/) |
| 36 | +- Access to the [Generative APIs service](/generative-apis/quickstart/) |
| 37 | +- [Python 3.11+](https://www.python.org/downloads/) installed on your local computer |
| 38 | +- [Chrome](https://www.google.com/intl/fr/chrome/gsem/download) or [Firefox](https://www.firefox.com/fr/) browser installed on your local computer |
| 39 | + |
| 40 | +## Install required packages |
| 41 | + |
| 42 | +Run the following command to install the required packages: |
| 43 | +```bash |
| 44 | +pip install openai selenium |
| 45 | +``` |
| 46 | + |
| 47 | +- `openai` enables to query Holo2 model through Generative APIs |
| 48 | +- `selenium` enables interaction with browsers |
| 49 | + |
| 50 | +## Import dependencies |
| 51 | + |
| 52 | +Create a new `holo2-agent.py` file with the following content: |
| 53 | + |
| 54 | +```python |
| 55 | +#holo2-agent.py |
| 56 | +from openai import OpenAI |
| 57 | +from selenium import webdriver |
| 58 | +from selenium.webdriver.common.actions.action_builder import ActionBuilder |
| 59 | +import base64 |
| 60 | +import os |
| 61 | +import json |
| 62 | +import time |
| 63 | +``` |
| 64 | + |
| 65 | +## Define tasks to perform and website to browse |
| 66 | + |
| 67 | +Define what tasks you want your agent to perform and on which website: |
| 68 | + |
| 69 | +```python |
| 70 | +#holo2-agent.py |
| 71 | +TASKS = ["Accept cookies", "Click on changelog", "Select newly added feature"] |
| 72 | +WEBSITE_URL = "https://www.scaleway.com/en/docs/" |
| 73 | +``` |
| 74 | +For this example we will make the agent go to the Scaleway Documentation website, and look for recently added features. |
| 75 | + |
| 76 | +## Create the model output structure |
| 77 | + |
| 78 | +By default, Holo2 outputs text data. However, it can be guided to output coordinates in a structured way using `JSON`: |
| 79 | + |
| 80 | +```python |
| 81 | +#holo2-agent.py |
| 82 | +output_structure = { |
| 83 | + 'x': { |
| 84 | + 'description': 'The x coordinate, normalized between 0 and 1000.', |
| 85 | + 'type': 'integer' |
| 86 | + }, |
| 87 | + 'y': { |
| 88 | + 'description': 'The y coordinate, normalized between 0 and 1000.', |
| 89 | + 'type': 'integer' |
| 90 | + } |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +This structure ensures coordinates are valid within the expected range (0-1000), normalized and independant of the exact browser window size). |
| 95 | + |
| 96 | +## Connect to Scaleway's Generative APIs |
| 97 | + |
| 98 | +Set up the OpenAI client to connect to Scaleway's API: |
| 99 | + |
| 100 | +```python |
| 101 | +#holo2-agent.py |
| 102 | +client = OpenAI( |
| 103 | + base_url="https://api.scaleway.ai/v1", |
| 104 | + api_key=os.getenv("SCW_SECRET_KEY") # Store your IAM API KEY in this environment variable or replace directly with your IAM API key |
| 105 | +) |
| 106 | +``` |
| 107 | + |
| 108 | +## Identify where to click on an image |
| 109 | + |
| 110 | +Create a function that sends a task and an interface screenshot to the AI model and output (x,y) coordinates on the screen. The coordinates correspond to the location the agent needs to click to perform the task: |
| 111 | + |
| 112 | +```python |
| 113 | +#holo2-agent.py |
| 114 | +def get_next_action(task): |
| 115 | + # Read and encode the current screenshot |
| 116 | + with open("current_screen.png", "rb") as file: |
| 117 | + image_content = file.read() |
| 118 | + base64_img = base64.b64encode(image_content).decode("utf-8") |
| 119 | + |
| 120 | + # Send the image and task to the vision model |
| 121 | + response = client.chat.completions.create( |
| 122 | + model="holo2-30b-a3b", |
| 123 | + messages=[ |
| 124 | + { |
| 125 | + "role": "user", |
| 126 | + "content": [ |
| 127 | + { |
| 128 | + "type": "image_url", |
| 129 | + "image_url": { |
| 130 | + "url": f"data:image/png;base64,{base64_img}", |
| 131 | + }, |
| 132 | + }, |
| 133 | + { |
| 134 | + "type": "text", |
| 135 | + "text": f"""Localize an element on the GUI image according to the provided target and output a click position. |
| 136 | + * You must output a valid JSON following the format: {json.dumps(output_structure)} |
| 137 | + Your target is: {task}""" |
| 138 | + } |
| 139 | + ] |
| 140 | + } |
| 141 | + ], |
| 142 | + max_tokens=10000, |
| 143 | + temperature=0.8, |
| 144 | + top_p=0.95 |
| 145 | + ) |
| 146 | + |
| 147 | + # Parse the response |
| 148 | + next_action = json.loads(response.choices[0].message.content) |
| 149 | + return next_action |
| 150 | +``` |
| 151 | + |
| 152 | +## Automate browser actions |
| 153 | + |
| 154 | +Use Selenium to: |
| 155 | +- Take a screenshot of the current page |
| 156 | +- Click on the right location to perform the task. |
| 157 | + |
| 158 | +Click coordinates are retrieved from Holo2 using the `get_next_action` function. These coordinates are adjusted to exact browser coordinates using `window.innerWidth` and `window.innerHeight` browser properties. |
| 159 | + |
| 160 | +```python |
| 161 | +#holo2-agent.py |
| 162 | +driver = webdriver.Chrome() # Use webdriver.Firefox() for Firefox |
| 163 | + |
| 164 | +try: |
| 165 | + # Navigate to the target website |
| 166 | + driver.get(WEBSITE_URL) |
| 167 | + time.sleep(3) # Wait for the page to load |
| 168 | + |
| 169 | + # Process each task in sequence |
| 170 | + for task in TASKS: |
| 171 | + # Take a screenshot of the current page |
| 172 | + driver.save_screenshot('current_screen.png') |
| 173 | + |
| 174 | + # Get the current page dimensions |
| 175 | + page_width = driver.execute_script("return window.innerWidth;") |
| 176 | + page_height = driver.execute_script("return window.innerHeight;") |
| 177 | + |
| 178 | + # Get click position from the AI model |
| 179 | + next_action = get_next_action(task) |
| 180 | + |
| 181 | + # Convert normalized coordinates to actual screen coordinates |
| 182 | + click_x = (next_action['x'] / 1000) * page_width |
| 183 | + click_y = (next_action['y'] / 1000) * page_height |
| 184 | + |
| 185 | + # Create and perform the click action |
| 186 | + action = ActionBuilder(driver) |
| 187 | + action.pointer_action.move_to_location(click_x, click_y) |
| 188 | + action.pointer_action.click() |
| 189 | + action.perform() |
| 190 | + |
| 191 | + print(f"Performing task: {task}. Clicked at coordinates: X={click_x}, Y={click_y}") |
| 192 | + time.sleep(3) # Wait to see the result |
| 193 | + |
| 194 | +finally: |
| 195 | + # Always close the browser |
| 196 | + driver.quit() |
| 197 | +``` |
| 198 | + |
| 199 | +## Run the agent |
| 200 | + |
| 201 | +Execute the agent with: |
| 202 | +```bash |
| 203 | +SCW_SECRET_KEY="your_scaleway_secret_key" \ |
| 204 | +python holo2-agent.py |
| 205 | +``` |
| 206 | + |
| 207 | +You should see your browser open and perform the given actions until the most recently added features are displayed. |
| 208 | + |
| 209 | +Congratulations! You have built a web browser agent navigating through website only based on text written tasks. |
| 210 | + |
| 211 | +## Going further |
| 212 | + |
| 213 | +- Add different tasks and see the agent adapt to them |
| 214 | +- Add other actions types such as scrolling through the page or typing text. |
| 215 | +- Use [Pydantic](https://docs.pydantic.dev/latest/) to define output structure and check its validity |
| 216 | +- Extract information from the browsed page |
| 217 | +- Use the agent with "thick" client applications, such as desktop or mobile applications (without Selenium) |
0 commit comments