Skip to content

Algorithm

Kiwhan Song edited this page Feb 11, 2025 · 1 revision

Diffusion Forcing Transformer

algorithms/dfot/dfot_video.py contains the main implementation of the Diffusion Forcing Transformer (DFoT) algorithm for video data. It contains our proposed training objective, general sampling procedure, and more. Please refer to the docstrings in the file for more details. algorithms/dfot/dfot_video_pose.py and algorithms/dfot/dfot_robot.py are specialized versions of DFoT for pose-conditioned video generation and robot imitation learning, respectively. Likewise, you can add your own specialized versions of DFoT by creating new files in the algorithms/dfot directory, inheriting from DFoTVideo.

Backbones

We provide three plug-and-play backbones for the Diffusion Forcing Transformer:

  • U-ViT – Recommended for high-resolution pixel-space diffusion models.
  • DiT – Recommended for latent diffusion models.
  • U-Net – Recommended for low-resolution models in data-scarce environments.

VAEs

We provide two types of VAEs for compressing videos into latent space:

  • ImageVAE (algorithms/vae/image_vae) – An image-wise VAE based on the Stable Diffusion VAE.
  • Chunk-wise VideoVAE (algorithms/vae/video_vae) – Processes videos chunk-by-chunk, similar to CausalVideoVAE but without the causal structure and without compressing the entire video at once.

Training and Preprocessing

  • Training VAEs: algorithm={image,video}_vae experiment=video_latent_learning
  • Preprocessing Videos into Latents with ImageVAEs: algorithm=image_vae_preprocessor experiment=video_latent_preprocessing

Clone this wiki locally