Inquiry about the reproducibility of Qwen3-VL-2B-Instruct · Issue #23 · likaixin2000/ScreenSpot-Pro-GUI-Grounding

Hi,
Thanks for the great details in the Qwen3VL reference implementation at the function below:

ScreenSpot-Pro-GUI-Grounding/models/qwen3vl.py

Lines 20 to 61 in 573b40e

    
           def get_qwen3vl_prompt_msg(image, instruction, screen_width, screen_height): 
        
               return [ 
        
                   { 
        
                       "role": "system", 
        
                       "content": [ 
        
                           { 
        
                               "type": "text", 
        
                               "text":  
        
                               """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.\n\n# Tools 
        
           You may call one or more functions to assist with the user query. 
        
           You are provided with function signatures within <tools></tools> XML tags: 
        
           <tools> 
        
           {"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}} 
        
           </tools> 
        
           For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: 
        
           <tool_call> 
        
           {"name": <function-name>, "arguments": <args-json-object>} 
        
           </tool_call>""".replace(r"{screen_width}", str(screen_width)).replace(r"{screen_height}", str(screen_height)) 
        
                           } 
        
                       ] 
        
                   }, 
        
                   { 
        
                       "role": "user", 
        
                       "content": [ 
        
                           { 
        
                               "type": "image_url", 
        
                               # "min_pixels": 3136, 
        
                               # "max_pixels": 12845056, 
        
                               "image_url": { 
        
                                   "url": "data:image/png;base64," + convert_pil_image_to_base64(image) 
        
                               } 
        
                           }, 
        
                           { 
        
                               "type": "text", 
        
                               "text": instruction 
        
                           } 
        
                       ] 
        
                   } 
        
               ]

By strictly following this prompt and using a sufficiently large max_pixels, I successfully reproduced the performance of Qwen3-VL-8B-Instruct. However, for Qwen3-VL-2B-Instruct, I only achieved a score of 43.33, whereas the technical report shows 48.5. I also tried the prompt shown in the report, but obtained even lower performance.
Is there anything I might be missing that could help me reproduce the correct result for this smaller model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about the reproducibility of Qwen3-VL-2B-Instruct #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	def get_qwen3vl_prompt_msg(image, instruction, screen_width, screen_height):
	return [
	{
	"role": "system",
	"content": [
	{
	"type": "text",
	"text":
	"""You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.\n\n# Tools

	You may call one or more functions to assist with the user query.

	You are provided with function signatures within <tools></tools> XML tags:
	<tools>
	{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
	</tools>

	For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
	<tool_call>
	{"name": <function-name>, "arguments": <args-json-object>}
	</tool_call>""".replace(r"{screen_width}", str(screen_width)).replace(r"{screen_height}", str(screen_height))
	}
	]
	},
	{
	"role": "user",
	"content": [
	{
	"type": "image_url",
	# "min_pixels": 3136,
	# "max_pixels": 12845056,
	"image_url": {
	"url": "data:image/png;base64," + convert_pil_image_to_base64(image)
	}
	},
	{
	"type": "text",
	"text": instruction
	}
	]
	}
	]

Inquiry about the reproducibility of Qwen3-VL-2B-Instruct #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions