Thank you for all your hard work. I love the idea, but I'm having trouble trusting the model score graph.
Say I click on a model, like GPT 5.4 and look at the measurement timestamps. The graph shows tests ranging from 1 minute apart to 4 hours apart (April 15th 1am and April 15th 1:01am). The graph doesn't make sense because the x axis isn't to scale. It makes the chart feel misleading and makes the trend harder to interpret. I'd like to see a little more consistency with when the models get tested.
Anyway, good work and I'm glad someone is trying to measure this because it seems like everyone is in the dark about random performance degradation.
Thank you for all your hard work. I love the idea, but I'm having trouble trusting the model score graph.
Say I click on a model, like GPT 5.4 and look at the measurement timestamps. The graph shows tests ranging from 1 minute apart to 4 hours apart (April 15th 1am and April 15th 1:01am). The graph doesn't make sense because the x axis isn't to scale. It makes the chart feel misleading and makes the trend harder to interpret. I'd like to see a little more consistency with when the models get tested.
Anyway, good work and I'm glad someone is trying to measure this because it seems like everyone is in the dark about random performance degradation.