Skip to content

PranavSitaraman/Compositional-Video-Gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compositional Video Generation

Video gen framework of compositional diffusion models cross-attended by LLM activations to animate robot trajectories. Aims to transfer language compositionality to a ~0.3B video model via self-supervised learning for synthetic data gen and generalizable reasoning.

Configurations

All configurations are compute-matched for fair comparison.

Baselines

  • Config 0: Legacy T5 Encoder (RoboDreamer-style)
    • Text: T5 Encoder (monolithic).
    • Approach: Standard diffusion conditioning.
    • Purpose: Baseline for comparison with traditional approaches.
  • Config 1: LLM Hidden Activation (Improved Baseline)
    • Text: Llama-2-7B (last 3 layers averaged).
    • Approach: Enhanced LLM conditioning via hidden states.
    • Purpose: State-of-the-art non-compositional baseline.

Compositional Approaches

  • Config 2: Hierarchical Layer-wise Diffusion
    • Text: Llama-2-7B (3 horizontal slices: layers 0-10, 11-21, 22-31).
    • Approach: Components conditioned on different LLM depths (Syntax → Semantic → Reasoning).
    • Purpose: Test if hierarchical language features improve video composition.
  • Config 3: Aspect-wise Diffusion
    • Text: Llama-2-7B (3 layers, shared features).
    • Approach: Each component factorizes a specific video aspect (Spatial + Temporal + Appearance).
    • Purpose: Test if explicit aspect factorization improves OOD generalization.
  • Config 6: Ensemble-based Diffusion
    • Text: Llama-2-7B (3 layers, shared features).
    • Approach: 3 independent cross-attention networks learn diverse specializations with soft fusion.
    • Purpose: Test if weight diversity alone enables better composition.
ID Approach Key Mechanism
2 Hierarchical LLM layer specialization
3 Aspect-wise Semantic aspect decomposition
6 Ensemble-based Independent multi-perspective learning
1 Optimal LLM Enhanced standard conditioning
0 Legacy Baseline

Setup

# Cluster environment
module load python/3.12.5-fasrc01 cuda/11.8.0-fasrc01 cudnn/8.9.2.26_cuda11-fasrc01

# Environment setup
conda create -n rtx python=3.9
conda activate rtx

# Core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
conda install -c conda-forge spacy diffusers

# Additional tools
pip install git+https://github.com/hassony2/torch_videovision --no-deps
huggingface-cli login

Usage

Training

Use train_rtx.py with the desired configuration ID:

python train_rtx.py --save_id 2 --batch_size 4 --lr 1e-4
sbatch train_run.sh

Adjust --save_id in train_run.sh to switch configurations.

Inference

Use test_rtx.py to generate videos from prompts:

python test_rtx.py \
    --model_path ./results_2/model-latest.pt \
    --text_prompt "robot arm picks up red cube" \
    --save_id 2
sbatch test_run.sh

Repository Structure

  • train_rtx.py: Main entry point for training.
  • test_rtx.py: Main entry point for inference.
  • config.py: Configuration management for all approaches.
  • integrate_compositional.py: Factory and wrapper for compositional models.
  • hierarchical_diffusion.py: Implementation of hierarchical (Config 2).
  • aspect_wise_diffusion.py: Implementation of aspect-wise (Config 3).
  • ensemble_diffusion.py: Implementation of ensemble-based (Config 6).
  • goal_diffusion_rtx.py: Base diffusion implementation (Configs 0-1).
  • datasets_rtx.py: Dataset loading and preprocessing.

Acknowledgments

Extended from RoboDreamer; uses Open X-Embodiment dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors