I think I discovered a possible misstatement in the paper, specifically in Figure 10 of Section 4.1. The caption states that the left image has context while the right one does not, but the figure appears to show the opposite. I'm not sure if this is a typo.
Additionally, I cannot determine which one has better consistency—it seems like the right image might be better. I would like to ask how you evaluated the consistency when context were used as a variable, as I did not see any quantitative analysis of this ablation experiment in the paper.
I think I discovered a possible misstatement in the paper, specifically in Figure 10 of Section 4.1. The caption states that the left image has context while the right one does not, but the figure appears to show the opposite. I'm not sure if this is a typo.
Additionally, I cannot determine which one has better consistency—it seems like the right image might be better. I would like to ask how you evaluated the consistency when context were used as a variable, as I did not see any quantitative analysis of this ablation experiment in the paper.