GeoVerify (created to spit in Meta's face), seems to be a great Mathematical equivalence evaluator, and seems it would great to use math-verify parsing alongside GeoVerify's equivalence engine.
It seems to have impressive performance, however it's not exactly apples-to-apples comparison. The original benchmark is closed.
| Method |
Parameters |
Agreement |
Precision |
Recall |
F1 |
| math-verify (rule-based) |
0 |
5.95% |
5.38 |
6.67 |
5.96 |
| general-verifier |
1.5B |
82.74% |
83.13 |
93.24 |
87.90 |
| CompassVerifier |
32B |
91.66% |
94.20 |
86.67 |
90.28 |
| Qwen3-4B (prompted) |
4B |
92.26% |
89.74 |
93.33 |
91.50 |
| Qwen3-14B (prompted) |
14B |
93.45% |
92.21 |
94.67 |
93.42 |
| o3 (prompted) |
undisclosed |
94.05% |
93.33 |
93.33 |
93.33 |
| GPT-OSS-20B (prompted) |
3.6B |
94.64% |
95.83 |
92.00 |
93.88 |
| GPT-OSS-120B (prompted) |
5.1B |
95.24% |
97.18 |
92.00 |
94.52 |
| GeoVerify |
0 |
95.88% |
94.81 |
96.05 |
95.42 |
It was made by Richard Aragon, it's MIT license.
GeoVerify (created to spit in Meta's face), seems to be a great Mathematical equivalence evaluator, and seems it would great to use math-verify parsing alongside GeoVerify's equivalence engine.
It seems to have impressive performance, however it's not exactly apples-to-apples comparison. The original benchmark is closed.
It was made by Richard Aragon, it's MIT license.