Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Chun-Hsiao Yeh; Chenyu Wang; Shengbang Tong; Ta-Ying Cheng; Ruoyu Wang; Tianzhe Chu; Yuexiang Zhai; Yubei Chen; Shenghua Gao; Yi Ma

doi:10.1609/aaai.v40i14.38188

Back to AAAI

AAAI 2026

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Conference Paper AAAI Technical Track on Computer Vision XI Artificial Intelligence

PDF Details DOI

Abstract

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. Our broad evaluation across 38 general-purpose and 3D spatial reasoning MLLMs reveals a substantial performance gap compared to humans. More critically, our analysis identifies two root failure modes: (1) cross-view object mismatch—the inability to establish consistent object correspondence across views; and (2) cross-view spatial misalignment—the failure to infer accurate camera poses and spatial layouts. These findings underscore a lack of multi-view awareness in current MLLMs, calling for architectural innovations beyond prompt tuning alone. We believe that our benchmark offers valuable insights toward building spatially-intelligent MLLMs.

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Abstract

Authors

Keywords

Context