Any plans or experiments on multi-image (multi-view) reasoning?

Hi authors, 

Congrats on the awesome project! really cool work.
I was curious if you’ve done any experiments on multi-image reasoning and grounding. Since Qwen 2.5-VL already supports multi-image input, it seems like an interesting direction to explore.

Thanks!