Each annotation requires listening to a pair of tracks that are reproduced in independent players. Thus, to change from A to B you have to stop player A and start player B.
This could be optimized to reduce the number of clicks required by forcing to stop one player when the other starts sounding.