Skip to content

Inquiry about the reproducibility of Qwen3-VL-2B-Instruct #23

@panlinchao

Description

@panlinchao

Hi,
Thanks for the great details in the Qwen3VL reference implementation at the function below:

def get_qwen3vl_prompt_msg(image, instruction, screen_width, screen_height):
return [
{
"role": "system",
"content": [
{
"type": "text",
"text":
"""You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.\n\n# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>""".replace(r"{screen_width}", str(screen_width)).replace(r"{screen_height}", str(screen_height))
}
]
},
{
"role": "user",
"content": [
{
"type": "image_url",
# "min_pixels": 3136,
# "max_pixels": 12845056,
"image_url": {
"url": "data:image/png;base64," + convert_pil_image_to_base64(image)
}
},
{
"type": "text",
"text": instruction
}
]
}
]

By strictly following this prompt and using a sufficiently large max_pixels, I successfully reproduced the performance of Qwen3-VL-8B-Instruct. However, for Qwen3-VL-2B-Instruct, I only achieved a score of 43.33, whereas the technical report shows 48.5. I also tried the prompt shown in the report, but obtained even lower performance.
Is there anything I might be missing that could help me reproduce the correct result for this smaller model?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions