A ComfyUI custom node for generating synchronized audio for videos using the HunyuanVideo-Foley model.
- Generate realistic sound effects synchronized with video content
- Support for both video file input and frame batch input from other ComfyUI nodes
- Flexible model selection through UI dropdowns
- Audio output for further processing in ComfyUI workflows
- Optional saving of audio and merged video files
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/railep/ComfyUI-HunyuanVideo-Foley
cd ComfyUI-HunyuanVideo-FoleyInstall required dependencies:
pip install -r requirements.txtDownload from: https://huggingface.co/tencent/HunyuanVideo-Foley
- Place
hunyuanvideo_foley.pthinComfyUI/models/diffusion_models/ - Place
foley_vae_128d_48k.pthinComfyUI/models/vae/
Download from: https://huggingface.co/google/siglip-base-patch16-512
- Create folder
ComfyUI/models/clip_vision/siglip2-base-patch16-512/ - Download all files (
config.json,model.safetensors, etc.) into this folder
Download from: https://huggingface.co/laion/clap-htsat-unfused
- Create folder
ComfyUI/models/clap/if it doesn't exist - Create subfolder
ComfyUI/models/clap/clap-htsat-unfused/ - Download all model files into this folder
Download from the HunyuanVideo-Foley repository:
- Place
synchformer_state_dict.pthinComfyUI/syncforner/
Note: Ensure the folder name matches your local setup.
After setup, your directory structure should look like:
ComfyUI/
├── models/
│ ├── diffusion_models/
│ │ └── hunyuanvideo_foley.pth
│ ├── vae/
│ │ └── foley_vae_128d_48k.pth
│ ├── clip_vision/
│ │ └── siglip2-base-patch16-512/
│ │ ├── config.json
│ │ └── model.safetensors
│ └── clap/
│ └── clap-htsat-unfused/
│ ├── config.json
│ └── pytorch_model.bin
├── syncforner/
│ └── synchformer_state_dict.pth
└── custom_nodes/
└── ComfyUI-HunyuanVideo-Foley/
Node: Hunyuan Foley: Generate Audio
prompt: Text description for audio generationconfig_name: Configuration file selectiondiffusion_model: Select diffusion model from dropdownvae_model: Select VAE model from dropdownclip_vision_model: Select CLIP vision model from dropdownclap_model: Select CLAP model from dropdownguidance_scale: Control generation quality (default: 4.5)num_inference_steps: Number of denoising steps (default: 50)save_video: Save merged video with audiosave_audio: Save generated audio filevideo_path(optional): Direct path to video filevideo(optional): Frame batch input from other nodesvideo_fps: Frames per second for frame batch inputoutput_dir: Output directory for saved files
audio: Audio tensor for further processingsample_rate: Audio sample rate (48000 Hz)audio_wav_path: Path to saved audio file (if saved)merged_video_path: Path to merged video file (if saved)
- Load a video using a Video Loader node.
- Connect the frame output to the video input.
- Set your audio generation prompt.
- Configure save options as needed.
- Run the generation.
Files will be saved as:
hunyuan_foley_00001.wavhunyuan_foley_00001.mp4
with automatic numbering.
- CUDA-capable GPU recommended (8GB+ VRAM)
- Python 3.8+
- PyTorch 2.0+
If you encounter permission errors, ensure the CLIP vision folder contains all necessary files and has proper read permissions.
The node requires FFmpeg for video processing. Install it if not present:
- Windows: https://ffmpeg.org
- Linux:
sudo apt install ffmpeg - Mac:
brew install ffmpeg
Based on the HunyuanVideo-Foley model by Tencent.
This project follows the licensing terms of the original HunyuanVideo-Foley model. Please refer to the original repository for detailed license information.