Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang*

SouthEast University, Tencent, Zhejiang University

Updates

2025.05.28: The code and model are released.
2024.10.10: Our work has been accepted by AAAI 2025.

Abstract

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is errorprone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture (C3VG), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of C3VG, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.

FrameWork

Installation

CUDA=11.8 torch=2.0.0 torchvision=0.15.0

Prerequisites

pip install -r requirements.txt

Data Preparation

Prepare the mscoco dataset, Then download the mixed annotations here

The data structure should look like the following:

| -- data
| -- annotations
    | -- mixed-seg
        | -- instances_nogoogle.json
| -- images
    | -- mscoco
        | -- train2014

Pre-trained Weights

C3VG utilizes the BEiT-3 model as both the backbone and the multi-modality fusion module. The pre-trained weights can be downloaded from this link. Additionally, you will need to download the tokenizer for BEiT-3.

First, create a directory for the pre-trained weights:

mkdir pretrain_weights

Place the BEiT checkpoints and tokenizer within this directory.

The final directory structure of C3VG should resemble the following:

C3VG
├── configs
├── data
├── docs
├── pretrain_weights
├── c3vg
└── tools

Training

We train C3VG on a 2 RTX4090 GPU with 24 GB memory. The following script performs the training:

bash tools/dist_train.sh configs/C3VG-Mix.py 2

Evaluation

You can use the following instruction for testing all type of models.

bash tools/dist_test.sh configs/C3VG-Mix.py 2 --load-from [PATH_TO_CHECKPOINT_FILE]

Models & Results

Note: Due to the unavailability of the original paper’s pretrained weights, we retrained an additional version. The results may exhibit slight variations compared to those reported in the original paper. For reference, we also provide the paper's training logs.

Models

The trained model can be download in this link.

If you want to reproduce the result, download it and then run the following scripts:

bash tools/dist_test.sh [PATH_TO_CONFIG] [GPU_NUMBER] --load-from [PATH_TO_CHECKPOINT_FILE]

Results

Split	DetAcc (paper)	MaskAcc	mIoU (paper)	oIoU (paper)
val_refcoco_unc	92.40 (92.51)	92.23	81.42 (81.37)	80.95 (80.89)
testA_refcoco_unc	94.81 (94.60)	94.73	82.98 (82.93)	82.91 (83.18)
testB_refcoco_unc	89.63 (88.71)	89.10	79.86 (79.12)	79.03 (77.86)
val_refcocoplus_unc	87.21 (87.44)	87.05	77.00 (77.05)	74.32 (74.68)
testA_refcocoplus_unc	90.59 (90.69)	90.59	79.53 (79.61)	77.84 (77.96)
testB_refcocoplus_unc	81.61 (81.42)	81.39	72.90 (72.40)	69.29 (68.95)
val_refcocog_umd	87.85 (87.68)	85.62	76.20 (76.34)	74.85 (74.43)
test_refcocog_umd	88.19 (88.31)	87.16	77.05 (77.10)	76.43 (76.39)

Acknowledge

This repository partially builds upon the codebases of SimVG and BEiT-3.

Citation

@article{dai2025multi,
  title={Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints},
  author={Dai, Ming and Li, Jian and Zhuang, Jiedong and Zhang, Xian and Yang, Wankou},
  journal={arXiv preprint arXiv:2501.06710},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
c3vg		c3vg
configs		configs
docs		docs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation_start.sh		ablation_start.sh
multi_task.sh		multi_task.sh
pretrain_weights		pretrain_weights
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Updates

Abstract

FrameWork

Installation

Prerequisites

Data Preparation

Pre-trained Weights

Training

Evaluation

Models & Results

Models

Results

Acknowledge

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

Dmmm1997/C3VG

Folders and files

Latest commit

History

Repository files navigation

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Updates

Abstract

FrameWork

Installation

Prerequisites

Data Preparation

Pre-trained Weights

Training

Evaluation

Models & Results

Models

Results

Acknowledge

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages