Make masking more efficient for `mxm`, `mxv`, `vxm`

Currently, masking on these is done after the fact, meaning a lot of compute power could have been wasted calculating things that won't survive the mask. A better approach would be to pass the mask into the compiled code, iterate over it in `linalg.generic` as a third input, and apply the mask inside the loop.