Video gen framework of compositional diffusion models cross-attended by LLM activations to animate robot trajectories. Aims to transfer language compositionality to a ~0.3B video model via self-supervised learning for synthetic data gen and generalizable reasoning.
All configurations are compute-matched for fair comparison.
- Config 0: Legacy T5 Encoder (RoboDreamer-style)
- Text: T5 Encoder (monolithic).
- Approach: Standard diffusion conditioning.
- Purpose: Baseline for comparison with traditional approaches.
- Config 1: LLM Hidden Activation (Improved Baseline)
- Text: Llama-2-7B (last 3 layers averaged).
- Approach: Enhanced LLM conditioning via hidden states.
- Purpose: State-of-the-art non-compositional baseline.
- Config 2: Hierarchical Layer-wise Diffusion
- Text: Llama-2-7B (3 horizontal slices: layers 0-10, 11-21, 22-31).
- Approach: Components conditioned on different LLM depths (Syntax → Semantic → Reasoning).
- Purpose: Test if hierarchical language features improve video composition.
- Config 3: Aspect-wise Diffusion
- Text: Llama-2-7B (3 layers, shared features).
- Approach: Each component factorizes a specific video aspect (Spatial + Temporal + Appearance).
- Purpose: Test if explicit aspect factorization improves OOD generalization.
- Config 6: Ensemble-based Diffusion
- Text: Llama-2-7B (3 layers, shared features).
- Approach: 3 independent cross-attention networks learn diverse specializations with soft fusion.
- Purpose: Test if weight diversity alone enables better composition.
| ID | Approach | Key Mechanism |
|---|---|---|
| 2 | Hierarchical | LLM layer specialization |
| 3 | Aspect-wise | Semantic aspect decomposition |
| 6 | Ensemble-based | Independent multi-perspective learning |
| 1 | Optimal LLM | Enhanced standard conditioning |
| 0 | Legacy | Baseline |
# Cluster environment
module load python/3.12.5-fasrc01 cuda/11.8.0-fasrc01 cudnn/8.9.2.26_cuda11-fasrc01
# Environment setup
conda create -n rtx python=3.9
conda activate rtx
# Core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
conda install -c conda-forge spacy diffusers
# Additional tools
pip install git+https://github.com/hassony2/torch_videovision --no-deps
huggingface-cli loginUse train_rtx.py with the desired configuration ID:
python train_rtx.py --save_id 2 --batch_size 4 --lr 1e-4sbatch train_run.shAdjust --save_id in train_run.sh to switch configurations.
Use test_rtx.py to generate videos from prompts:
python test_rtx.py \
--model_path ./results_2/model-latest.pt \
--text_prompt "robot arm picks up red cube" \
--save_id 2sbatch test_run.shtrain_rtx.py: Main entry point for training.test_rtx.py: Main entry point for inference.config.py: Configuration management for all approaches.integrate_compositional.py: Factory and wrapper for compositional models.hierarchical_diffusion.py: Implementation of hierarchical (Config 2).aspect_wise_diffusion.py: Implementation of aspect-wise (Config 3).ensemble_diffusion.py: Implementation of ensemble-based (Config 6).goal_diffusion_rtx.py: Base diffusion implementation (Configs 0-1).datasets_rtx.py: Dataset loading and preprocessing.
Extended from RoboDreamer; uses Open X-Embodiment dataset.