🦮 MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval

A retrieval framework that achieves modality interaction without fusing textual features during the alignment. Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations.

Web demo

Install

Python 3.10

conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
pip install pylate
pip install -r requirements.txt

Settings

Download the pre-trained ColBERTv2 checkpoint to ckpts and unzip the downloaded file to ckpts/colbertv2.0.

We employ ColBERTv2 as a text retriever baseline.
Install python packages.
```
pip3 install -r requirements.txt
```
Download downstream task datasets.

We use four retrieval datasets curated from OK-VQA, ReMuQ, and E-VQA. You can download the dataset in the following links:
- OK-VQA (Wiki-11M) Download annotation files to data/okvqa.
- OK-VQA (Google Search) In this dataset, questions in the annotation files include captions for images. Thus, we edit the questions to remove captions. See dataset/vqa_ret.py for details.
- ReMuQ
- E-VQA

Our pre-trained checkpoints can be downloaded from the HuggingFace Hub.

Dataset Construction via the Response-to-Passage Conversion: ViD2R

Download instruction data and image datasets from the following pages:
- Visual instruction dataset (Here, download images with the dialogue dataset)
- LVIS-Instruct4V
Pre-processing and neural filtering using a text retriever:

You can skip the neural filtering step by modifying the code if you want to build the dataset fast.
```
python3 -m runs.neural_filtering --data_paths path_to_data1 path_to_data2 --colbert_ckpt [directory_with_colbert_checkpoint] --save_path [path_to_save]
```
Converting responses to passages:

We require a knowledge base (KB) and a text retriever to convert dialogues to retrieval tasks. We adopt 6M Wikipedia passages as the KB. You can download the passages in this link.
```
python3 -m runs.convert_tasks --data_path [path to pre-processed data] --colbert_ckpt [directory with colbert checkpoint] --db_pool [path to KB] --save_path data/vid2r/ViD2R.json
```

Please check scripts/make_pretraining_data as an example.

Training MIRe

First, set configure files!

Pre-training MIRe on the ViD2R datset

export WANDB_API_KEY=[Your_WANDB_KEY]
CONFIG_PATH=cfgs/mire_train_vid2r.yaml

export CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC=4

# Caching visual embeddings
python3 -m runs.run_visual_embedder --model_name openai/clip-vit-base-patch32 --data_path data/vid2r/ViD2R.json --batch_size 512 --image_dir data/vid2r/images

# Run training command
python3 -m torch.distributed.run --nproc_per_node=$NPROC train_retrieval.py --config_path "$CONFIG_PATH"

or

After modifying the shell file scripts/pretrain_mire_inbatch.sh to your path, execute the following command:

bash scripts/pretrain_mire_inbatch.sh

This shell file evaluates zero-shot performance after training. If indexing has already been done, comment out the execution of run_indexer.

Fine-tuning MIRe on the downstream task

If you want to cache visual embeddings, set image_cached to True in the config file and execute runs/run_visual_embedder.py after slightly modifying the code.

export WANDB_API_KEY=[Your_WANDB_KEY]
CONFIG_PATH=[Path_to_config_file]

export CUDA_VISIBLE_DEVICES=0,1,2,3
NPROC=4

# Run training command
python3 -m torch.distributed.run --nproc_per_node=$NPROC train_retrieval.py --config_path "$CONFIG_PATH"

Pre-Indexing

If you do not input --mire_ckpt, the following code loads the checkpoint of ColBERTv2.

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m runs.run_indexer --exp_name [experiment_name] --n_bits [2|4|8] --dataset_name [okvqa|okvqa_gs|infoseek] --all_blocks_file [path to knowledge base] --mire_ckpt $CHECKPOINT

Evaluation

If you do not provide --mire_ckpt and --image_dir, this following code use the text retriever (ColBERTv2)

python3 -m runs.evaluate_retrieval \
    --dataset_name [okvqa|okvqa_gs|infoseek] \
    --index_name [experiment_name].nbits=[n_bits] \
    --save_path [path to save result file] \
    --all_blocks_file [path to knowledge base] \
    --anno_file [path to test file] \
    --mire_ckpt $CHECKPOINT \
    --image_dir [directory to images]

Web Demo

Settings

pip3 install flask flask_cors
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash
nvm install --lts

Web Server Start

cd searcher
npm install
npm start

Search Engine Start

Before you start this engine, check checkpoint path in this code.

python3 search_api.py

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
cfgs		cfgs
dataset		dataset
retrievers		retrievers
runs		runs
scripts		scripts
searcher		searcher
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
search_api.py		search_api.py
train_retrieval.py		train_retrieval.py
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦮 MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval

Install

Settings

Dataset Construction via the Response-to-Passage Conversion: ViD2R

Training MIRe

Pre-Indexing

Evaluation

Web Demo

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦮 MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval

Install

Settings

Dataset Construction via the Response-to-Passage Conversion: ViD2R

Training MIRe

Pre-Indexing

Evaluation

Web Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages