Comprehensive Visual Grounding for Video Description

Wenhui Jiang; Yibo Cheng; Linxin Liu; Yuming Fang; Yuxin Peng; Yang Liu

doi:10.1609/aaai.v38i3.28032

Back to AAAI

AAAI 2024

Comprehensive Visual Grounding for Video Description

Conference Paper AAAI Technical Track on Computer Vision II Artificial Intelligence

PDF Details DOI

Abstract

The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.

Comprehensive Visual Grounding for Video Description

Abstract

Authors

Keywords

Context