CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Zhihong Zhu; Yunyan Zhang; Fan Zhang; Bowen Xing; Xian Wu

doi:10.1609/aaai.v40i41.40835

Back to AAAI

AAAI 2026

CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Conference Paper AAAI Technical Track on Natural Language Processing VI Artificial Intelligence

PDF Details DOI

Abstract

Medical Visual Question Answering (Med-VQA) aims to generate accurate answers for clinical questions grounded in medical images, which has attracted increasing research attention due to its potential to streamline diagnostics and reduce clinical burden. Recent advances in Large Vision-Language Models (LVLMs) have shown great promise for Med-VQA, but still suffer from two inference-time issues: (1) attention shift, where the LVLM over-relies on textual priors; and (2) attention dispersion, where it fails to focus on critical diagnostic regions. To tackle these issues, we propose Contrastive Mutual Information Decoding (CMID), a training-free inference-time intervention grounded in information theory for Med-VQA. Concretely, CMID first identifies the Principal Focus Area (PFA) from decoder attention maps, then constructs focus-preserving and focus-excluding views to derive dual contrastive signals that simultaneously amplify salient visual cues and suppress background noise. Crucially, these corrective signals are adaptively scaled by a reliability-gated self-correction mechanism, based on the distributional shift induced by the PFA. Extensive experiments on three Med-VQA benchmarks demonstrate the effectiveness of CMID. Further analyses showcase its robust generalizability across diverse medical architectures and tasks.

CMID: Towards Medical Visual Question Answering via Contrastive Mutual Information Decoding

Abstract

Authors

Keywords

Context