Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Ming Liu; Hao Chen; Jindong Wang; Liwen Wang; Jingchen Sun; Wensheng Zhang

doi:10.1609/aaai.v40i38.40480

Back to AAAI

AAAI 2026

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Conference Paper AAAI Technical Track on Natural Language Processing III Artificial Intelligence

PDF Details DOI

Abstract

Vision-Language Models (VLMs) have achieved success in tasks such as visual question answering, yet their resilience to distractions remains underexplored. Understanding how distractions affect VLMs' performance is crucial for real-world applications, as input data often contains noisy or irrelevant content. This paper assesses the robustness of VLMs—including general-purpose models and those specialized for reasoning—against distractions in the context of science question answering. We introduce I-ScienceQA, a new benchmark based on the ScienceQA dataset, which systematically injects distractions into both visual and textual contexts. We evaluate how distractions perturb the underlying reasoning processes of these models by analyzing changes in textual explanations leading to answers. Our findings show that most VLMs are vulnerable to distractions, with a noticeable degradation in reasoning when extraneous content is present. In particular, some models (including GPT-o4 mini) exhibit a higher degree of robustness. We also observe that textual distractions generally cause greater performance declines than visual distractions. Finally, we explore mitigation strategies such as prompt engineering. Although these strategies improve resilience modestly, our analysis highlights considerable room for further improvement in the robustness of VLMs.

Is Your (Reasoning) Multimodal Language Model Vulnerable Toward Distractions?

Abstract

Authors

Keywords

Context