Arrow Research search
Back to AAAI

AAAI 2021

Global Fusion Attention for Vision and Language Understanding (Student Abstract)

Short Paper AAAI Student Abstract and Poster Program Artificial Intelligence

Abstract

We extend the popular transformer architecture to a multimodal model, processing both visual and textual inputs. We propose a new attention mechanism on Transformer-based architecture for the joint vision and language understanding tasks. Our model fuses multi-level comprehension between images and texts in a weighted manner, which could better curve the internal relationships. Experiments on benchmark VQA dataset CLEVR demonstrate the effectiveness of the proposed attention mechanism. We also observe the improvements in sample efficiency of reinforcement learning through the experiments on grounded language understanding tasks of BabyAI platform.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
54007396461722607