Algorithm

Diffusion Forcing Transformer

algorithms/dfot/dfot_video.py contains the main implementation of the Diffusion Forcing Transformer (DFoT) algorithm for video data. It contains our proposed training objective, general sampling procedure, and more. Please refer to the docstrings in the file for more details. algorithms/dfot/dfot_video_pose.py and algorithms/dfot/dfot_robot.py are specialized versions of DFoT for pose-conditioned video generation and robot imitation learning, respectively. Likewise, you can add your own specialized versions of DFoT by creating new files in the algorithms/dfot directory, inheriting from DFoTVideo.

Backbones

We provide three plug-and-play backbones for the Diffusion Forcing Transformer:

U-ViT – Recommended for high-resolution pixel-space diffusion models.
DiT – Recommended for latent diffusion models.
U-Net – Recommended for low-resolution models in data-scarce environments.

VAEs

We provide two types of VAEs for compressing videos into latent space:

ImageVAE (algorithms/vae/image_vae) – An image-wise VAE based on the Stable Diffusion VAE.
Chunk-wise VideoVAE (algorithms/vae/video_vae) – Processes videos chunk-by-chunk, similar to CausalVideoVAE but without the causal structure and without compressing the entire video at once.

Training and Preprocessing

Training VAEs: algorithm={image,video}_vae experiment=video_latent_learning
Preprocessing Videos into Latents with ImageVAEs: algorithm=image_vae_preprocessor experiment=video_latent_preprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithm

Diffusion Forcing Transformer

Backbones

VAEs

Training and Preprocessing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally