Skip to content
/ MLLM-4D Public

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Notifications You must be signed in to change notification settings

GVCLab/MLLM-4D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

ArXiv | PDF | Model | Dataset

1 University of Macau, 2 GVC Lab, Great Bay University, 3 Xi’an Jiaotong University, 4 CUHKSZ

TL;DR: MLLM-4D achieves advanced visual-based spatial-temporal intelligence. Our method specifically focuses on understanding and reasoning about the time-evolving relationships between objects and camera within 3D space.

Teaser

βš™οΈ Setup

1. Clone MLLM-4D

git clone https://github.com/GVCLab/MLLM-4D.git
cd MLLM-4D

2. Setup environments

MLLM-4D is tested with CUDA 12.1/12.8 on H100.

conda create -n mllm4d python=3.10
conda activate mllm4d 
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

3. Download pretrained models

python scripts/download_ckpt_hf.py

πŸ’« Inference

1. Inference Demo

# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT
# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT

πŸ“‹ TODO

  • We have completed the code and data cleanup. Release coming soon!
  • RFT Stage: Release the MLLM4D-R1-30k dataset and Reinforcement Fine-Tuning code!
  • Cold-Start Phase: Release the Cold-Start Data and Cold-Start Fine-Tuning code!
  • SFT Stage: Release the MLLM4D-2M dataset and Supervised Fine-Tuning code!
  • [2026.02.28] πŸ”₯ Release the arXiv paper, inference demo, and pretrained weights!

πŸ€— Acknowledgement

If you find MLLM-4D useful, please help ⭐ this repo, which is important to Open-Source projects. Thanks!

Our work is built upon Qwen3-VL, thanks to their invaluable contributions!

πŸ“œ Citation

If you find the work useful, please consider citing:


About

MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages