Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT w/ PE achieves better overall performance than the baseline systems when pre-training with the same amount of computational resources.
翻译:自2017年以来,以变换器为基础的模型在下游各种自然语言处理任务中发挥着关键作用。然而,变换器编码器所使用的关注机制的一个共同局限性是,它无法自动获取单词顺序信息,因此通常需要将明确位置嵌入目标模型。相反,带有因果关注面罩的变换器编码器自然对单词顺序敏感。在这项工作中,我们侧重于提高BERT与因果关注面罩的定位编码能力。此外,我们提出了一个新的预先培训的语言模型DecBERT, 并在GLUE基准上对其进行评估。实验结果显示:(1)因果关注面罩在语言理解任务上对BERT有效;(2) 我们没有定位嵌入的DecBERT模型在GLUE基准上实现类似业绩;(3) 我们的修改加快了培训前进程,DecBERT w/PE在用同样数量的计算资源进行预培训时比基准系统的总体业绩要好。