Evaluation Script for QA Datasets

Thanks for your nice work! I have a question about the evaluation of QA datasets. I calculated the EM metric by myself and found it much lower than the numbers reported in your paper. Could you provide your evaluation script for QA datasets? Thanks!