Performance of qwen3.5

Hello, thanks for sharing the metric.

I evaluated Qwen3.5 on ScreenSpot Pro, and the performance I reproduced was higher than what is reported. The paper states 68.5, but my reproduced result is 71, both with and without thinking mode. So I am a bit confused about which number should be considered correct.

The official performance is listed here:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8#instruct-or-non-thinking-mode

For reproduction, I used this code:
https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding/blob/main/models/qwen3_5.py

Could you clarify why there is this discrepancy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of qwen3.5 #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance of qwen3.5 #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions