ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zi-Hang Jiang; Weihao Yu; Daquan Zhou; Yunpeng Chen; Jiashi Feng; Shuicheng Yan

Back to NeurIPS

NeurIPS 2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Conference Paper Artificial Intelligence · Machine Learning

PDF Details

Abstract

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86. 4 GLUE score, 0. 7 higher than ELECTRAbase, using less than 1/4 training cost. Code and pre-trained models will be released.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Abstract

Authors

Keywords

Context