Cost of a Single Evaluation Using GPT-4.1 in HealthBench

I noticed that the HealthBench paper mentions: 
`Evaluating LLMs on clinical tasks is challenging and expensive. HealthBench, while leveraging GPT-4.1 as a judge to reduce human effort, still incurs API costs.`
I'm wondering approximately how much it costs to perform a single evaluation using GPT-4.1 as a judge?