The multithreading utilization is not maximum because of the SAD calculation (which has to wait for the previous frame to finish). This makes a lot of threads wait quite some time:

Also, there is no measurable time spend in the SAD function (computeTextureSAD). We should try to move this function into the main thread. It is so cheap that it can be done here quickly and the threads can work instead of wait.