Adaptive Cross-Modal Embeddings for Image-Text Alignment

Jonatas Wehrmann; Camila Kolling; Rodrigo C Barros

Back to AAAI

AAAI 2020

Adaptive Cross-Modal Embeddings for Image-Text Alignment

Conference Paper AAAI Technical Track: Vision Artificial Intelligence

PDF Details

Abstract

In this paper, we introduce a novel approach for training image-text alignment models, namely ADAPT. Image-text alignment methods are often used for cross-modal retrieval, i. e. , to retrieve an image given a query text, or captions that successfully label an image. ADAPT is designed to adjust an intermediate representation of instances from a modality a using an embedding vector of an instance from modality b. Such an adaptation is designed to ﬁlter and enhance important information across internal features, allowing for guided vector representations – which resembles the working of attention modules, though far more computationally efﬁcient. Experimental results on two large-scale Image-Text alignment datasets show that ADAPT-models outperform all the baseline approaches by large margins. Particularly, for Image Retrieval, ADAPT, with a single model, outperforms the state-of-the-art approach by a relative improvement of R@1 ≈ 24% and for Image Annotation, R@1 ≈ 8% on Flickr30k dataset. On MS COCO it provides an improvement of R@1 ≈ 12% for Image Retrieval, and ≈ 7% R@1 for Image Annotation. Code is available at https: //github. com/jwehrmann/ retrieval. pytorch.

Adaptive Cross-Modal Embeddings for Image-Text Alignment

Abstract

Authors

Keywords

Context