Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought (CVPR2026)

Overview

Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. This code repository privides a minimal Python implementation of RED and experimental evaluations with GQA.

Notes

[2026/4/1] We observed a significant performance degradation of Qwen2.5-VL-7B in the latest transformer version for some reasons. Please try to use Qwen3-VL-8B instead of Qwen2.5-VL-7B for demonstration.

[2026/4/17] We observed a discrepancy between theory and experiment regarding $\lambda$. While $\lambda=0$ should theoretically yield the same result as the image-conditional $p(y|x,q)$, its performance is significantly worse than that of the Direct baseline (with $p(y|x,q)$ ). We suspect that this discrepancy is caused by numerical errors in the logit calculations resulting from the batch parallelization implementation we added after acceptance to improve speed, which is causing the distribution to diverge. Please note that the experiments in the main paper calculate image- and rationale-conditional logits in a serial manner, not in batch parallel. We plan to publish the serial version as soon as possible.

Requirements

Middleware Requirements

CUDA >= 12.3

Python Requirements

Run pip install -r requirements.txt

Preparations

Evaluation Dataset: GQA

1. Download input images from here
1. Extract and place images in data/gqa/images

Example on Qwen3VL-8B

bash experiments/01_benchmarks/qwen3-vl-8b/gqa/red.sh

Citation

@inproceedings{Yamaguchi_CVPR26_RED,
  title={Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought},
  author={Yamaguchi, Shin'ya and Nishida, Kosuke and Chijiwa, Daiki},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Acknowledgement

We thank Xie Yumu for reporting a bug in the logit computation with detailed reports, which helped us identify the theory-implementation discrepancy.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data/gqa		data/gqa
evaluation		evaluation
experiments/01_benchmarks		experiments/01_benchmarks
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
top.png		top.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought (CVPR2026)

Overview

Notes

Requirements

Middleware Requirements

Python Requirements

Preparations

Evaluation Dataset: GQA

Example on Qwen3VL-8B

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought (CVPR2026)

Overview

Notes

Requirements

Middleware Requirements

Python Requirements

Preparations

Evaluation Dataset: GQA

Example on Qwen3VL-8B

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages