Open
Description
Thank you for your excellent work, but I have some questions.
- From Tab. 1 of the paper, it seems that the method applied a single-layer linear layer with GLU activation, but there appears to be no mention of the GLU activation function in the code. Have I missed something or misunderstood?
- Have you tried other language models like T5 as the text encoder? Should GTE-en-large-v1.5 and NV-Embed-v2 be selected because they can output a CLS token for alignment with the image's CLS token?
- At the Alignment Tuning stage, are only the pre-encoded image's and text's CLS tokens loaded onto the GPU?
Metadata
Metadata
Assignees
Labels
No labels