❗ This is Luchao's implementation of Over++.
- [2026.02] Release the inference code
- Recommended environment: python 3.10, CUDA 12.6, torch 2.9.1, diffusers 0.36.0.
- Please check requirements.txt for the dependencies.
- [Optional] Install SAM2 by following the instructions.
Please download the following models. For CogVideoX-Fun-V1.5-5b-InP, you may also use the code below to download it.
mkdir -p models/Diffusion_Transformer
cd models/Diffusion_Transformer
git lfs install
git clone https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP| Model | Download link | Description |
|---|---|---|
| CogVideoX-Fun-V1.5-5b-InP | huggingface | Pre-trained inpainting model containing vae, encoder, transformer, etc. Please follow the instructions in videox_fun and save it to models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP. |
| Over++ | huggingface | The model is a transformer module fine-tuned from VideoX-Fun's released inpainting model. This is not a full model, it only replaces the transformer weights of the CogVideoX-Fun-V1.5-5b-InP pipeline above. Please download the transformer weights and save to /PATH/TO/CHECKPOINTS |
Prepare your input data in the following structure:
examples/
├── your-sequence-name/
│ ├── input_video.mp4 # Input video without effects
│ ├── trimask_00.mp4 # Mask with white for effect regions, black for unchanged regions, and gray for unknown regions (can be full gray mask video if no specific regions are to be changed)
│ ├── trimask_01.mp4 # Additional mask for inference (optional)
│ └── prompt.json # Text prompt: {"bg": "A kid in rain boots runs through puddles, sending turbulent water splashing in all directions, with sprays shooting high into the air."}
You can run inference on a single GPU by running the following command example:
python inference/cogvideox_fun/predict_v2v.py \
--config.experiment.save_foreground=True \
--config.experiment.save_path="output_temp" \
--config.data.data_rootdir="examples" \
--config.experiment.run_seqs="boy-water,pexles_car_drift" \
--config.experiment.skip_if_exists=False \
--config.data.dilate_width=0 \
--config.video_model.guidance_scale=6 \
--config.video_model.transformer_path="PATH/TO/CHECKPOINTS/diffusion_pytorch_model.safetensors"Note: The
guidance_scaleis the CFG parameter that controls the trade-off between effect generation and content preservation. A higher value results in stronger effect generation but may also alter the original color tone. We recommend using a value between 2 and 20 (default: 6). A more advanced CFG formulation, such as PickStyle - Video-to-Video Style Transfer | Pickford AI, may address this issue in the future.
Due to torch version compatibility issues, you may consider usingexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueor installingflash-attnif having OOM errors during inference.
You can run inference on multiple GPUs by running the following command example:
python inference/helper_inference_multi_gpu.py \
--input_dir 'examples' \
--output_dir 'output/your_output_dir' \
--n_gpus 8 \
--transformer_path 'PATH/TO/Over++/transformer.safetensors' \
--prompt_guidance_scale 6We construct a diverse training dataset combining paired and unpaired videos to enable effective effect generation while preserving the base model's text-to-video capabilities.
Please refer to ./dataset for more details.
We follow the procedure here to fine-tune CogVideoX-5B inpainting model on 8 NVIDIA A6000 GPUs:
accelerate launch --use_deepspeed --deepspeed_config_file config/zero_stage2_config.json --deepspeed_multinode_launcher standard scripts/cogvideox_fun/train.py \
--pretrained_model_name_or_path="PATH/TO/models/Diffusion_Transformer/CogVideoX-Fun-V1.5-5b-InP" \
--train_data_meta="PATH/TO/datasets/train-casper/casper.json" \
--image_sample_size=512 \
--video_sample_size=256 \
--token_sample_size=512 \
--video_sample_stride=1 \
--video_sample_n_frames=85 \
--train_batch_size=1 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--num_train_epochs=5 \
--checkpointing_steps=1000 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--vae_mini_batch=1 \
--max_grad_norm=0.05 \
--random_hw_adapt \
--training_with_video_token_length \
--random_frame_crop \
--enable_bucket \
--use_came \
--use_deepspeed \
--train_mode="casper" \
--dataloader_num_workers=0 \
--report_to="wandb" \
--trainable_modules "." \
--binarize_mask \
--output_dir="training/PATH/TO/OUTPUT_DIR"We thank the authors of VideoX-Fun, SAM2, and gen-omnimatte for their shared codes and models.
We also appreciate the results from Omnimatte, Omnimatte3D, OmnimatteRF, and OmnimatteZero, as well as the videos on Pexels [1,2,3,4,5,6,7,8,9,10,11,12,13], which were used for fine-tuning Over++.
The views and conclusions contained herein are those of the authors and do not represent the official policies or endorsements of these institutions.
If you find our repo useful for your research, please consider citing our paper:
@misc{qi2025overgenerativevideocompositing,
title={Over++: Generative Video Compositing for Layer Interaction Effects},
author={Luchao Qi and Jiaye Wu and Jun Myeong Choi and Cary Phillips and Roni Sengupta and Dan B Goldman},
year={2025},
eprint={2512.19661},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.19661},
}