Arrow Research search
Back to AAAI

AAAI 2026

TargetVAU: Multimodal Anomaly-Aware Reasoning for Target Behavior Understanding in Videos

Conference Paper AAAI Technical Track on Computer Vision XIII Artificial Intelligence

Abstract

Understanding anomalous human behaviors at a fine-grained level remains a major challenge in complex scenarios. Existing video anomaly understanding (VAU) methods often rely on coarse frame-level cues or overlook structured modeling of individual actions, limiting their capacity for reasoning about human interactions and accountability. To address these challenges, we propose TargetVAU, a multimodal anomaly-aware reasoning framework designed for individual-level anomaly recognition and explanation. TargetVAU first extracts both global-level and human-centric visual features using a frozen Vision Transformer (ViT) encoder. An Anomaly-focused Temporal Sampler is then employed to select behaviorally informative frames via a density-aware strategy guided by predicted anomaly scores. A Spatio-Temporal Interaction Graph is constructed to explicitly model interactions among individuals across time and space. These structured representations are fused with prompt embeddings via a frozen Q-Former to form a unified semantic representation. Finally, a large language model fine-tuned with low-rank adaptation (LoRA) performs instruction-guided reasoning to identify anomalous individuals and generate natural language explanations. Extensive experiments on UCCD and HIVAU-70K demonstrate that TargetVAU significantly outperforms existing methods in both accuracy and interpretability, advancing the state of individual-level anomaly understanding in surveillance videos.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
1105973748062912892