Hi! Currently debugging and updating lm_evals's implementation of minerva_math. Currently we basically use:
gold = parse(normalize_final_answer(remove_boxed(last_boxed_only_string(doc["solution"]))))
answer = parse(raw_model_output)
and then verify(gold, answer)
however, this does have some edge cases for example for a gold (after removing boxed):
parse("\\dfrac{9}{7}")
>>> []
but with boxed works correctly:
parse("\\boxed{\\dfrac{9}{7}}")
>>> [9/7, '\\frac{9}{7}']
I was looking at how lighteval does it, and would this generally work on most MATH tasks, or do y'll handle sub tasks differently? Should we also normalize before parsing?
Would appreciate any thoughts!
Hi! Currently debugging and updating lm_evals's implementation of
minerva_math. Currently we basically use:and then
verify(gold, answer)however, this does have some edge cases for example for a gold (after removing boxed):
but with boxed works correctly:
I was looking at how
lightevaldoes it, and would this generally work on most MATH tasks, or do y'll handle sub tasks differently? Should we also normalize before parsing?Would appreciate any thoughts!