Skip to content

autogressive model #18

@HyunsooCha

Description

@HyunsooCha

Hi,

Thank you very much for providing such a valuable open-source project.
After reviewing the LingBot-World technical report and code, I have a question about the implementation details.

The technical report describes the method as an autoregressive video diffusion model, but after checking the code, it appears to be based on Wan2.2-Animate-14B and looks like a bidirectional video diffusion model. Since I do not have deep expertise in this area yet, I would like to ask whether it is indeed an autoregressive model, and if so, where the relevant implementation/code is located.

More specifically, this code appears to initialize noise latents for all frames and run a single forward pass with bidirectional attention (i.e., generate everything at once). In contrast, the technical report seems to describe chunk-wise generation with causal attention, but I have not been able to confirm (from the implementation) that either chunk generation or causal attention is actually implemented. Likewise, it does not appear to implement frame-by-frame causal flags or an attention mask compatible with FlashAttention.

From what I can see, the current implementation looks more like image-to-video (image animation based on wan animate) with camera control conditioning via Plücker coordinates.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions