About benchmark

In the paper, you test the method on different datasets like NQ TriviaQA PopQA HotpotQA 2wiki Musique Bamboogle. I want to know how you evaluate these downstream tasks. Do you use any benchmarks?