Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Amirmohammad Izadi; Mohammadali Banayeeanzade; Fatemeh Askari; Ali Rahimiakbar; Mohammad Vahedi; Hosein Hasani; Mahdieh Baghshah

Back to NeurIPS

NeurIPS 2025

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Despite progress in Large Vision-Language Models (LVLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current LVLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces Visual Input Structure for Enhanced Reasoning (VISER), a simple, effective method that augments visual inputs with low-level spatial structures and pairs them with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Specifically, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25. 0%, 26. 8%, and 9. 5%, respectively, and reduces edit distance error in scene description by 0. 32 on 2D datasets. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Abstract

Authors

Keywords

Context