Skip to content

2Elian/ADT-Net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adt-net: adaptive transformation-driven text-based person search network for enhancing cross-modal retrieval robustness

Abstract

Text-based person search aims to retrieve person images matching a given textual description. The challenge lies in mapping images and textual descriptions into a unified semantic space. This paper introduces ADT-Net, a novel framework designed to address the issue of excessive intra-class variance and insufficient inter-class variance caused by lighting variations. ADTNet comprises two key modules: Invariant Representation Learning, which employs style transfer strategies and multi-scale alignment techniques to learn visually invariant features, and Dynamic Matching Alignment, which introduces nonlinear transformations and learnable dynamic temperature parameters to optimize the prediction distribution. Experimental results on multiple benchmark datasets demonstrate that ADT-Net outperforms current mainstream baseline methods, achieving superior retrieval accuracy and generalization ability. Here, we show that our proposed method significantly enhances the robustness of cross-modal person retrieval, particularly under varying lighting conditions and shooting angles.


Usage

Requirements

  • torch: 1.3.1
  • torchvision: 0.14.1
  • transformers: 4.46.3

Prepare Datasets

  1. Download the CUHK-PEDES dataset from here, ICFG-PEDES dataset from here and RSTPReid dataset form here.

  2. Organize them in your dataset root dir folder as follows:

|-- data/
|   |-- <CUHK-PEDES>/
|       |-- imgs
|           |-- cam_a
|           |-- cam_b
|           |-- ...
|       |-- reid_raw.json
|
|   |-- <ICFG-PEDES>/
|       |-- imgs
|           |-- test
|           |-- train
|       |-- ICFG-PEDES.json
|
|   |-- <RSTPReid>/
|       |-- imgs
|       |-- data_captions.json
  1. Installation Environment
conda create -n adtnet python=3.8

conda activate adtnet

pip install torch==1.3.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1+cu117

pip install -r requirements.txt

Training

python train.py

Result

Here, using the CHUK-PEDES dataset as an example, we demonstrate a performance comparison experiment between our method and other methods.

Method ImageEnc TextEnc Ref R@1 R@5 R@10 mAP mINP
SSAN [10] ResNet50 LSTM Arxiv’21 61.37 80.15 86.73 * *
LBUL [14] ResNet50 BERT MM’22 64.04 82.66 87.22 * *
TIPCB [13] ResNet50 BERT Neuro’22 64.26 83.19 89.10 * *
MANet [11] ResNet50 BERT TNNLS’23 65.64 83.01 88.78 * *
ACSA [19] ResNet50 BERT TMM’22 68.67 85.61 90.66 * *
CFine [12] CLIP-ViT-B/16 BERT TIP’23 69.57 85.93 91.15 * *
IRRA [16] CLIP-ViT-B/16 CLIP-Xformer CVPR’23 73.38 89.93 93.71 66.13 50.24
TBPS-CLIP [36] CLIP-ViT-B/16 CLIP-Xformer AAAI’24 73.54 88.19 92.35 65.38 *
IRLT [37] CLIP-ViT-B/16 CLIP-Xformer AAAI’24 74.46 90.19 94.01 * *
RDE [38] CLIP-ViT-B/16 CLIP-Xformer CVPR’24 75.94 90.14 94.12 67.56 51.44
IRRA + Ours CLIP-ViT-B/16 CLIP-Xformer * 74.14 89.45 93.61 66.84 51.41
TBPS-CLIP + Ours CLIP-ViT-B/16 CLIP-Xformer * 74.37 88.84 93.16 66.06 *
RDE + Ours CLIP-ViT-B/16 CLIP-Xformer * 76.46 90.40 93.91 68.13 51.98

Acknowledgments

This is my first SCI paper during my master's studies. I want to thank the authors of the IRRA paper; their work was extremely inspiring and guided my research throughout my graduate studies. From a complete novice, through tireless effort and continuous learning, I eventually became an algorithm engineer. I sincerely thank all the friends who have helped me along the way, and especially my supervisor for the financial support he provided during my research.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages