Skip to content

Clarify how fly.mp4 was generated (single image vs. video conditioning) #25

@kaihekaihe

Description

@kaihekaihe

Hi team, thanks for releasing LingBot-World. I have a question about the demo fly.mp4 and the “interactivity / controllable trajectory” claim.

From the README and code, the inference pipeline seems to be image-to-video with optional camera control signals (poses.npy / intrinsics.npy). This suggests the generator only conditions on a single image + prompt + camera trajectory.

However, fly.mp4 appears to show a long trajectory where the scene changes significantly. If only one image is used, content consistency might drift over long distances. That makes me wonder:

Was fly.mp4 generated using only a single image + camera control signals?
If so, which frame was used as the input image (first frame of the source video, a manually selected frame, etc.)?
Was any video conditioning used in addition to camera control (e.g., multi-frame conditioning, V2V, keyframes)?
Is there any recommended workflow for long trajectories (e.g., segmenting into multiple keyframes)?
Any clarification or repro steps would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions