-
Notifications
You must be signed in to change notification settings - Fork 31
Algorithm
algorithms/dfot/dfot_video.py contains the main implementation of the Diffusion Forcing Transformer (DFoT) algorithm for video data. It contains our proposed training objective, general sampling procedure, and more. Please refer to the docstrings in the file for more details. algorithms/dfot/dfot_video_pose.py and algorithms/dfot/dfot_robot.py are specialized versions of DFoT for pose-conditioned video generation and robot imitation learning, respectively. Likewise, you can add your own specialized versions of DFoT by creating new files in the algorithms/dfot directory, inheriting from DFoTVideo.
We provide three plug-and-play backbones for the Diffusion Forcing Transformer:
- U-ViT – Recommended for high-resolution pixel-space diffusion models.
- DiT – Recommended for latent diffusion models.
- U-Net – Recommended for low-resolution models in data-scarce environments.
We provide two types of VAEs for compressing videos into latent space:
-
ImageVAE (
algorithms/vae/image_vae) – An image-wise VAE based on the Stable Diffusion VAE. -
Chunk-wise VideoVAE (
algorithms/vae/video_vae) – Processes videos chunk-by-chunk, similar to CausalVideoVAE but without the causal structure and without compressing the entire video at once.
-
Training VAEs:
algorithm={image,video}_vae experiment=video_latent_learning -
Preprocessing Videos into Latents with ImageVAEs:
algorithm=image_vae_preprocessor experiment=video_latent_preprocessing