autogressive model

Hi,

Thank you very much for providing such a valuable open-source project.
After reviewing the LingBot-World technical report and code, I have a question about the implementation details.

The technical report describes the method as an autoregressive video diffusion model, but after checking the code, it appears to be based on Wan2.2-Animate-14B and looks like a bidirectional video diffusion model. Since I do not have deep expertise in this area yet, I would like to ask whether it is indeed an autoregressive model, and if so, where the relevant implementation/code is located. 

More specifically, this code appears to initialize noise latents for all frames and run a single forward pass with bidirectional attention (i.e., generate everything at once). In contrast, the technical report seems to describe chunk-wise generation with causal attention, but I have not been able to confirm (from the implementation) that either chunk generation or causal attention is actually implemented. Likewise, it does not appear to implement frame-by-frame causal flags or an attention mask compatible with FlashAttention.

From what I can see, the current implementation looks more like image-to-video (image animation based on wan animate) with camera control conditioning via Plücker coordinates.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autogressive model #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

autogressive model #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions