Skip to content

yshinya6/red

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

overview

Overview

Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. This code repository privides a minimal Python implementation of RED and experimental evaluations with GQA.

Notes

[2026/4/1] We observed a significant performance degradation of Qwen2.5-VL-7B in the latest transformer version for some reasons. Please try to use Qwen3-VL-8B instead of Qwen2.5-VL-7B for demonstration.

[2026/4/17] We observed a discrepancy between theory and experiment regarding $\lambda$. While $\lambda=0$ should theoretically yield the same result as the image-conditional $p(y|x,q)$, its performance is significantly worse than that of the Direct baseline (with $p(y|x,q)$ ). We suspect that this discrepancy is caused by numerical errors in the logit calculations resulting from the batch parallelization implementation we added after acceptance to improve speed, which is causing the distribution to diverge. Please note that the experiments in the main paper calculate image- and rationale-conditional logits in a serial manner, not in batch parallel. We plan to publish the serial version as soon as possible.

Requirements

Middleware Requirements

  • CUDA >= 12.3

Python Requirements

  • Run pip install -r requirements.txt

Preparations

Evaluation Dataset: GQA

    1. Download input images from here
    1. Extract and place images in data/gqa/images

Example on Qwen3VL-8B

bash experiments/01_benchmarks/qwen3-vl-8b/gqa/red.sh

Citation

@inproceedings{Yamaguchi_CVPR26_RED,
  title={Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought},
  author={Yamaguchi, Shin'ya and Nishida, Kosuke and Chijiwa, Daiki},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Acknowledgement

We thank Xie Yumu for reporting a bug in the logit computation with detailed reports, which helped us identify the theory-implementation discrepancy.

Releases

No releases published

Packages

 
 
 

Contributors