-
Notifications
You must be signed in to change notification settings - Fork 250
Description
Hi,
Thank you very much for providing such a valuable open-source project.
After reviewing the LingBot-World technical report and code, I have a question about the implementation details.
The technical report describes the method as an autoregressive video diffusion model, but after checking the code, it appears to be based on Wan2.2-Animate-14B and looks like a bidirectional video diffusion model. Since I do not have deep expertise in this area yet, I would like to ask whether it is indeed an autoregressive model, and if so, where the relevant implementation/code is located.
More specifically, this code appears to initialize noise latents for all frames and run a single forward pass with bidirectional attention (i.e., generate everything at once). In contrast, the technical report seems to describe chunk-wise generation with causal attention, but I have not been able to confirm (from the implementation) that either chunk generation or causal attention is actually implemented. Likewise, it does not appear to implement frame-by-frame causal flags or an attention mask compatible with FlashAttention.
From what I can see, the current implementation looks more like image-to-video (image animation based on wan animate) with camera control conditioning via Plücker coordinates.
Thank you.