Confused about evaluation code

Hello,
the functions in the "data extraction" folder seem to be dedicated to evaluation, but they produce very strange results (like a "Yes" generated answer getting full points where the gold label was "No"). Am I missing something ? Google got around 70 on a 4b model in their paper and I assume they didn't do simple "==" matching.