Hi,
Thanks for your fantastic work! I am trying to apply your work for videos. For that, I am trying to combine the VC features with the I3D features. While doing so, I am facing a few challenges. First of all, I have seen that for each frame of a video I get VC features with Nx1024 size where N represents the detected bounding boxes in the object which doesn't match with the size of I3D features. So, I was doing elementwise addition of all the features of the N bounding boxes to get a single feature representation of shape 1024.
Do you think it's a good idea? Will the features be preserved if I do addition like this? If not, do you have a better idea on how to do it so that I can combine with the I3D features?
Thanks!