-
Notifications
You must be signed in to change notification settings - Fork 250
Description
Hi team, thanks for releasing LingBot-World. I have a question about the demo fly.mp4 and the “interactivity / controllable trajectory” claim.
From the README and code, the inference pipeline seems to be image-to-video with optional camera control signals (poses.npy / intrinsics.npy). This suggests the generator only conditions on a single image + prompt + camera trajectory.
However, fly.mp4 appears to show a long trajectory where the scene changes significantly. If only one image is used, content consistency might drift over long distances. That makes me wonder:
Was fly.mp4 generated using only a single image + camera control signals?
If so, which frame was used as the input image (first frame of the source video, a manually selected frame, etc.)?
Was any video conditioning used in addition to camera control (e.g., multi-frame conditioning, V2V, keyframes)?
Is there any recommended workflow for long trajectories (e.g., segmenting into multiple keyframes)?
Any clarification or repro steps would be greatly appreciated. Thanks!