How to explicitly encode positional information into neural networks is important in learning the representation of natural languages, such as BERT. Based on the Transformer architecture, the positional information is simply encoded as embedding vectors, which are used in the input layer, or encoded as a bias term in the self-attention module. In this work, we investigate the problems in the previous formulations and propose a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE). Different from all other works, TUPE only uses the word embedding as input. In the self-attention module, the word contextual correlation and positional correlation are computed separately with different parameterizations and then added together. This design removes the addition over heterogeneous embeddings in the input, which may potentially bring randomness, and gives more expressiveness to characterize the relationship between words/positions by using different projection matrices. Furthermore, TUPE unties the [CLS] symbol from other positions to provide it with a more specific role to capture the global representation of the sentence. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness and efficiency of the proposed method: TUPE outperforms several baselines on almost all tasks by a large margin. In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs. We release our code at https://github.com/guolinke/TUPE.
翻译:如何将定位信息明确编码到神经网络中对于学习自然语言(如BERT)的表达方式很重要。 基于变换器结构, 定位信息简单地编码为嵌入矢量, 用于输入层, 或编码为自我注意模块中的偏差词。 在这项工作中, 我们调查先前的配方中的问题, 并为BERT提出一个新的定位编码方法, 名为“ 变异器, 带有不附带定位定位编码 ” (TUPE) 。 不同于所有其他工作, TUPE 仅使用“ 嵌入” 作为输入。 在自备模块中, 将“ 相关” 和“ 位置相关性” 分别编码为嵌入矢量, 用于输入层层中, 或者在自备模块中添加“ 嵌入“ 嵌入” 。 在使用不同的投影矩阵时, 我们TUPE 仅使用 [CLS] 符号, 以提供更具体的角色, 来捕捉取该句的全球表达方式。 在使用更高参数的模块中, 进行广泛的实验, 并用“ 基质/ 基底值 基值 研究, 实现一个特定的比 基值 基值 基数”,,,, 在使用一个特定的算法化一个特定的比标值,, 在使用一个特定的比值中, 在使用一个特定的比值中,, 。