I noticed that the HealthBench paper mentions:
Evaluating LLMs on clinical tasks is challenging and expensive. HealthBench, while leveraging GPT-4.1 as a judge to reduce human effort, still incurs API costs.
I'm wondering approximately how much it costs to perform a single evaluation using GPT-4.1 as a judge?