TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Lingru Zhou; Peng Wu; Manqing Zhang; Qingsheng Wang; Guansong Pang; Peng Wang

doi:10.1609/aaai.v40i16.38378

Back to AAAI

AAAI 2026

TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Conference Paper AAAI Technical Track on Computer Vision XIII Artificial Intelligence

PDF Details DOI

Abstract

Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.

TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Abstract

Authors

Keywords

Context