Clarify how fly.mp4 was generated (single image vs. video conditioning)

Hi team, thanks for releasing LingBot-World. I have a question about the demo [fly.mp4](app://-/index.html#) and the “interactivity / controllable trajectory” claim.

From the README and code, the inference pipeline seems to be image-to-video with optional camera control signals ([poses.npy](app://-/index.html#) / [intrinsics.npy](app://-/index.html#)). This suggests the generator only conditions on a single image + prompt + camera trajectory.

However, [fly.mp4](app://-/index.html#) appears to show a long trajectory where the scene changes significantly. If only one image is used, content consistency might drift over long distances. That makes me wonder:

Was [fly.mp4](app://-/index.html#) generated using only a single image + camera control signals?
If so, which frame was used as the input image (first frame of the source video, a manually selected frame, etc.)?
Was any video conditioning used in addition to camera control (e.g., multi-frame conditioning, V2V, keyframes)?
Is there any recommended workflow for long trajectories (e.g., segmenting into multiple keyframes)?
Any clarification or repro steps would be greatly appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify how fly.mp4 was generated (single image vs. video conditioning) #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarify how fly.mp4 was generated (single image vs. video conditioning) #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions