Towards Making the Most of BERT in Neural Machine Translation

Jiacheng Yang; Mingxuan Wang; Hao Zhou; Chengqi Zhao; Weinan Zhang; Yong Yu; Lei Li

Back to AAAI

AAAI 2020

Towards Making the Most of BERT in Neural Machine Translation

Conference Paper AAAI Technical Track: Natural Language Processing Artificial Intelligence

PDF Details

Abstract

GPT-2 and BERT demonstrate the effectiveness of using pretrained language models (LMs) on various natural language processing tasks. However, LM ﬁne-tuning often suffers from catastrophic forgetting when applied to resource-rich tasks. In this work, we introduce a concerted training framework (CTNMT) that is the key to integrate the pre-trained LMs to neural machine translation (NMT). Our proposed CTNMT consists of three techniques: a) asymptotic distillation to ensure that the NMT model can retain the previous pre-trained knowledge; b) a dynamic switching gate to avoid catastrophic forgetting of pre-trained knowledge; and c) a strategy to adjust the learning paces according to a scheduled policy. Our experiments in machine translation show CTNMT gains of up to 3 BLEU score on the WMT14 English-German language pair which even surpasses the previous state-of-the-art pretraining aided NMT by 1. 4 BLEU score. While for the large WMT14 English-French task with 40 millions of sentencepairs, our base model still signiﬁcantly improves upon the state-of-the-art Transformer big model by more than 1 BLEU score.

Towards Making the Most of BERT in Neural Machine Translation

Abstract

Authors

Keywords

Context