Hello, thanks for sharing the metric.
I evaluated Qwen3.5 on ScreenSpot Pro, and the performance I reproduced was higher than what is reported. The paper states 68.5, but my reproduced result is 71, both with and without thinking mode. So I am a bit confused about which number should be considered correct.
The official performance is listed here:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8#instruct-or-non-thinking-mode
For reproduction, I used this code:
https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding/blob/main/models/qwen3_5.py
Could you clarify why there is this discrepancy?
Hello, thanks for sharing the metric.
I evaluated Qwen3.5 on ScreenSpot Pro, and the performance I reproduced was higher than what is reported. The paper states 68.5, but my reproduced result is 71, both with and without thinking mode. So I am a bit confused about which number should be considered correct.
The official performance is listed here:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8#instruct-or-non-thinking-mode
For reproduction, I used this code:
https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding/blob/main/models/qwen3_5.py
Could you clarify why there is this discrepancy?