Question about time efficiency in the paper

Thanks for your great work.

I noticed that you have leveraged various techniques to accelerate inference. In your paper, DriveVLM utilizes MA-LMM for video input encoding, which confuses me because MA-LMM is not time-efficient for encoding large frames. For example, it takes about 500 seconds to encode 40 frames, as reported in the MA-LMM paper. This seems challenging for real-time processing. Am I missing something, or could you provide further insights?

![image](https://github.com/user-attachments/assets/d3ae26ca-ce95-4917-b22f-77574dd91b06)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about time efficiency in the paper #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about time efficiency in the paper #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions