keep in mind for the future and rewrite necessary for the current kernel can also apply the matmul batched speed up kernel