Thanks for your great work.
I noticed that you have leveraged various techniques to accelerate inference. In your paper, DriveVLM utilizes MA-LMM for video input encoding, which confuses me because MA-LMM is not time-efficient for encoding large frames. For example, it takes about 500 seconds to encode 40 frames, as reported in the MA-LMM paper. This seems challenging for real-time processing. Am I missing something, or could you provide further insights?
