At first, thank you for a innovative paper.
I read that you "mix the two image pairs" before this pair goes to encoder(Swin Transformer).
However, there is only one image in processing part and loss part.
Code does not match with the paper.
And I trained the code with 8 GPUs and 128 batch size applying 2 accum_iteration, it did not reproduce to performance in the paper.
Is the code is not latest version?