How to perform image-text understanding inference directly with OpenUni?

First of all, I would like to sincerely thank the OpenUni team for fully open-sourcing such a unified multimodal model for both understanding and generation. This is truly valuable for the community.

I have tested the internvl3_2b_sana_1_6b_512_hf_blip3o60k checkpoint and achieved generation results that are close to those reported in the paper.

However, I am not sure how to directly perform image-text understanding inference (e.g., MMBench， MMMU, MMStar, etc.) using OpenUni.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to perform image-text understanding inference directly with OpenUni? #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to perform image-text understanding inference directly with OpenUni? #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions