First of all, I would like to sincerely thank the OpenUni team for fully open-sourcing such a unified multimodal model for both understanding and generation. This is truly valuable for the community.
I have tested the internvl3_2b_sana_1_6b_512_hf_blip3o60k checkpoint and achieved generation results that are close to those reported in the paper.
However, I am not sure how to directly perform image-text understanding inference (e.g., MMBench, MMMU, MMStar, etc.) using OpenUni.