Thank you, authors, for the foundation work on the language-based 3D perception.
We've applied multi-token prediction to Scenescript and achieved 5x acceleration without sacrificing accuracy.
This work has been accepted to CVPR26: https://arxiv.org/pdf/2512.05597