Hello Size, @wusize, Thank you for your excellent work!
While replicating your code, I encountered an interesting issue regarding the OV-COCO experiment with the ViT-B16 model. I noticed that the F-ViT model was trained at a resolution of 640 instead of 1024. Although 640 is not the resolution used during backbone distillation, the F-ViT model achieved a higher novel AP at this resolution compared to the 1024 resolution.
I would like to inquire why this discrepancy occurs and the reasoning behind this design choice. Was this decision influenced by the experimental results?
I hope you can help clarify my doubts. Thank you!