Currently, masking on these is done after the fact, meaning a lot of compute power could have been wasted calculating things that won't survive the mask. A better approach would be to pass the mask into the compiled code, iterate over it in linalg.generic as a third input, and apply the mask inside the loop.