GitHub - GVCLab/MLLM-4D: MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

ArXiv | PDF | Model | Dataset

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space.

⚙️ Setup

1. Clone MLLM-4D

git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4D

2. Setup environments

MLLM-4D is tested with CUDA 12.1/12.8 on H100.

conda create -n mllm4d python=3.10
conda activate mllm4d 
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3. Download pretrained models

python scripts/download_ckpt_hf.py

💫 Inference

1. Inference Demo

# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT

📋 TODO

We have completed the code and data cleanup. Release coming soon!
RFT Stage: Release the MLLM4D-R1-30k dataset and Reinforcement Fine-Tuning code!
Cold-Start Phase: Release the Cold-Start Data and Cold-Start Fine-Tuning code!
SFT Stage: Release the MLLM4D-2M dataset and Supervised Fine-Tuning code!
[2026.02.28] 🔥 Release the arXiv paper, inference demo, and pretrained weights!

🤗 Acknowledgement

If you find MLLM-4D useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!

Our work is built upon Qwen3-VL, thanks to their invaluable contributions!

📜 Citation

If you find the work useful, please consider citing:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

⚙️ Setup

1. Clone MLLM-4D

2. Setup environments

3. Download pretrained models

💫 Inference

1. Inference Demo

📋 TODO

🤗 Acknowledgement

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

GVCLab/MLLM-4D

Folders and files

Latest commit

History

Repository files navigation

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Xingyilang Yin*1,2, Chengzhengxu Li*3, Jiahao Chang4, Chi-Man Pun1,📫, Xiaodong Cun2,📫

1 University of Macau, 2 GVC Lab, Great Bay University, 3 Xi’an Jiaotong University, 4 CUHKSZ

⚙️ Setup

1. Clone MLLM-4D

2. Setup environments

3. Download pretrained models

💫 Inference

1. Inference Demo

📋 TODO

🤗 Acknowledgement

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Xingyilang Yin^1,2, Chengzhengxu Li³, Jiahao Chang⁴, Chi-Man Pun^1,📫, Xiaodong Cun^2,📫

¹ University of Macau, ² GVC Lab, Great Bay University, ³ Xi’an Jiaotong University, ⁴ CUHKSZ

Packages