Hi,
Thanks for the great details in the Qwen3VL reference implementation at the function below:
|
def get_qwen3vl_prompt_msg(image, instruction, screen_width, screen_height): |
|
return [ |
|
{ |
|
"role": "system", |
|
"content": [ |
|
{ |
|
"type": "text", |
|
"text": |
|
"""You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.\n\n# Tools |
|
|
|
You may call one or more functions to assist with the user query. |
|
|
|
You are provided with function signatures within <tools></tools> XML tags: |
|
<tools> |
|
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}} |
|
</tools> |
|
|
|
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: |
|
<tool_call> |
|
{"name": <function-name>, "arguments": <args-json-object>} |
|
</tool_call>""".replace(r"{screen_width}", str(screen_width)).replace(r"{screen_height}", str(screen_height)) |
|
} |
|
] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image_url", |
|
# "min_pixels": 3136, |
|
# "max_pixels": 12845056, |
|
"image_url": { |
|
"url": "data:image/png;base64," + convert_pil_image_to_base64(image) |
|
} |
|
}, |
|
{ |
|
"type": "text", |
|
"text": instruction |
|
} |
|
] |
|
} |
|
] |
By strictly following this prompt and using a sufficiently large
max_pixels, I successfully reproduced the performance of Qwen3-VL-8B-Instruct. However, for Qwen3-VL-2B-Instruct, I only achieved a score of
43.33, whereas the technical report shows
48.5. I also tried the prompt shown in the report, but obtained even lower performance.
Is there anything I might be missing that could help me reproduce the correct result for this smaller model?
Hi,
Thanks for the great details in the Qwen3VL reference implementation at the function below:
ScreenSpot-Pro-GUI-Grounding/models/qwen3vl.py
Lines 20 to 61 in 573b40e
By strictly following this prompt and using a sufficiently large
max_pixels, I successfully reproduced the performance of Qwen3-VL-8B-Instruct. However, for Qwen3-VL-2B-Instruct, I only achieved a score of43.33, whereas the technical report shows48.5. I also tried the prompt shown in the report, but obtained even lower performance.Is there anything I might be missing that could help me reproduce the correct result for this smaller model?