Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model with memory extension and relative position encoding. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset. We further investigate the pre-training masked language modeling task with Segatron. Experimental results show that BERT pre-trained with Segatron (SegaBERT) can outperform BERT with vanilla Transformer on various NLP tasks, and outperforms RoBERTa on zero-shot sentence representation learning.
翻译:变异器对序列建模很有作用。 几乎所有最先进的语言模型和预先训练的语言模型都以变异器结构为基础。 但是, 它只用象征性位置索引来区分顺序符号。 我们假设变异器能够以更丰富的定位信息产生更好的背景表达方式。 为了核实这一点, 我们提议用段落、 句子和符号的合并位置编码来取代原始的代号变异器( Segatron ) 。 我们首先向变异器- XL 引入元件变异器- 语言模型, 这是一种流行的变异器- XL 语言模型, 带有记忆扩展和相对位置编码。 我们发现我们的方法可以进一步改进变异器- XL 基模型和大模型, 在 WikitText- 103 数据集上实现17.1 的不易懂性。 我们进一步调查与Segatron 一起进行的培训前遮蔽语言模型任务。 实验结果显示, 用SegaBERT( SegaBERT) 预先训练过的变异器可以超越BERTERT, 在各种 NLPP 任务上超越 Vanilla变异器, 的 RoBERTAREDTA