Hi, thanks for the great work!
I have a question regarding the real-time performance claim.
From the paper, I saw the following:
- The Generator and Refiner each take ~700 ms, and the VAE requires ~180 ms.
- The dual 17B DiT backbones run at ~0.35 s per chunk (1-NFE) on a single GPU.
Given these timings, does the full pipeline (Generator + Refiner + VAE) achieve real-time streaming on a single GPU in practice, or does it rely on multiple GPUs?
Also, the paper mentions that autoregressive generation is discretized into fixed 1-second chunks (24 fps). Does one chunk correspond to a latent length of 7 (i.e., roughly equivalent to 25 pixel frames), or am I misunderstanding this mapping?
Thanks!
Hi, thanks for the great work!
I have a question regarding the real-time performance claim.
From the paper, I saw the following:
Given these timings, does the full pipeline (Generator + Refiner + VAE) achieve real-time streaming on a single GPU in practice, or does it rely on multiple GPUs?
Also, the paper mentions that autoregressive generation is discretized into fixed 1-second chunks (24 fps). Does one chunk correspond to a latent length of 7 (i.e., roughly equivalent to 25 pixel frames), or am I misunderstanding this mapping?
Thanks!