Self-supervised visual representation learning from video. Part of the Zen LM ecosystem.
VJEPA2 implements Video Joint-Embedding Predictive Architecture for learning visual representations from unlabeled video data without relying on hand-crafted augmentations.
- Self-supervised learning from video
- No hand-crafted augmentations required
- Pre-trained visual encoder for downstream tasks
- Efficient training with masking strategies
See LICENSE file.