Xingyilang Yin*1,2, Chengzhengxu Li*3, Jiahao Chang4, Chi-Man Pun1,π«, Xiaodong Cun2,π«
ArXiv | PDF | Model | Dataset
TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space.
git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4DMLLM-4D is tested with CUDA 12.1/12.8 on H100.
conda create -n mllm4d python=3.10
conda activate mllm4d
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whlpython scripts/download_ckpt_hf.py# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT- We have completed the code and data cleanup. Release coming soon!
- RFT Stage: Release the
MLLM4D-R1-30kdataset andReinforcement Fine-Tuning code! - Cold-Start Phase: Release the
Cold-Start DataandCold-Start Fine-Tuning code! - SFT Stage: Release the
MLLM4D-2Mdataset andSupervised Fine-Tuning code! - [2026.02.28] π₯ Release the
arXiv paper,inference demo, andpretrained weights!
If you find MLLM-4D useful, please help β this repo, which is important to Open-Source projects. Thanks!
Our work is built upon Qwen3-VL, thanks to their invaluable contributions!
If you find the work useful, please consider citing:
