Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns

Menghao Zhang; Huazheng Wang; Pengfei Ren; Kangheng Lin; Qi Qi; Haifeng Sun; Zirui Zhuang; Lei Zhang; Jianxin Liao; Jingyu Wang

Back to NeurIPS

NeurIPS 2025

Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Large Vision-Language Models (LVLMs) pretrained on large-scale multimodal data have shown promising capabilities in Video Anomaly Detection (VAD). However, their ability to reason about abnormal events based on scene semantics remains underexplored. In this paper, we investigate LVLMs’ behavior in VAD from a visual-textual co-occurrence perspective, focusing on whether their decisions are driven by statistical shortcuts between visual instances and textual phrases. By analyzing visual-textual co-occurrence in pretraining data and conducting experiments under different data settings, we reveal a hallucination phenomenon: LVLMs tend to rely on co-occurrence patterns between visual instances and textual phrases associated with either normality or abnormality, leading to incorrect predictions when these high-frequency objects appear in semantically mismatched contexts. To address this issue, we propose VAD-DPO, a direct preference optimization method supervised with counter-example pairs. By constructing visually similar but semantically contrasting video clips, VAD-DPO encourages the model to align its predictions with the semantics of scene rather than relying on co-occurrence patterns. Extensive experiments on six benchmark datasets demonstrate the effectiveness of VAD-DPO in enhancing both anomaly detection and reasoning performance, particularly in scene-dependent scenarios.

Do LVLMs Truly Understand Video Anomalies? Revealing Hallucination via Co-Occurrence Patterns

Abstract

Authors

Keywords

Context