Attention-Aligned Transformer for Image Captioning

Zhengcong Fei

Back to AAAI

AAAI 2022

Attention-Aligned Transformer for Image Captioning

Conference Paper AAAI Technical Track on Computer Vision I Artificial Intelligence

PDF Details

Abstract

Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued “deviated focus” problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A2 - an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based selfsupervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A2 Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.

Authors

Zhengcong Fei Meituan

Keywords

No keywords are indexed for this paper.

Context

Venue: AAAI Conference on Artificial Intelligence
Archive span: 1980-2026
Indexed papers: 28718
Paper id: 226163788224705376