Benchmark table is not consistent (am I wrong?)

I read the paper and compare it to this website : https://www.indobenchmark.com/leaderboard.html . It seems that the sequence labelling benchmark is not the same. I also tried my own fine-tuning, and the result is closer to the one on the paper rather than that in the website. is there any explanation regarding this problem?