Questions on real-time performance and chunk definition

Hi, thanks for the great work!

I have a question regarding the real-time performance claim.

From the paper, I saw the following:

- The Generator and Refiner each take ~700 ms, and the VAE requires ~180 ms.
- The dual 17B DiT backbones run at ~0.35 s per chunk (1-NFE) on a single GPU.

Given these timings, does the full pipeline (Generator + Refiner + VAE) achieve real-time streaming on a single GPU in practice, or does it rely on multiple GPUs?

Also, the paper mentions that autoregressive generation is discretized into fixed 1-second chunks (24 fps). Does one chunk correspond to a latent length of 7 (i.e., roughly equivalent to 25 pixel frames), or am I misunderstanding this mapping?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on real-time performance and chunk definition #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions on real-time performance and chunk definition #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions