Thanks for sharing the source code.
I have a few questions regards to the input embeddings.
From your code, all text and images are pre-encoded using Bert and Resent. Therefore, I assume both models here are not trainable, and you only use these embeddings as inputs for your cross-modality model. (Please correct me if it’s wrong).
- Could you please kindly advise how did you plot the attention map in
section 4.6 since all inputs are embeddings, not actual text or images?
- Have you experimented with unfrozen BERT and Resnet and do end-to-end training?
thanks in advance.
Thanks for sharing the source code.
I have a few questions regards to the input embeddings.
From your code, all text and images are pre-encoded using
BertandResent. Therefore, I assume both models here are not trainable, and you only use these embeddings as inputs for your cross-modality model. (Please correct me if it’s wrong).section 4.6since all inputs are embeddings, not actual text or images?thanks in advance.