Your port was quite helpful in terms of file and (expecially) endian handling for my own version of llm.java.
I used Java Stream parallelization as well as you did. In a 2nd step I leveraged TornadoVM to run the methods that implement the layers as CUDA kernels (blog).
Regards
Jürgen