e.g. monitor the progress of the eval, if the number of already-failed problems makes it not possible to beat the top score, stop the eval and cancel the remainder
do the same for screeners, but if the threshold can't be met
See: #289
it would also be useful to reduce the concurrency, as this will help avoid unnecessary inference cost