In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.
翻译:在此工作中, 我们调查语言培训前( 如 BERT) 所使用的位置编码方法, 并找出现有配方中的一些问题 。 首先, 我们显示在绝对位置编码中, 对位置嵌入和字嵌入应用的附加操作在两种混杂信息资源之间产生了混杂的关联性 。 这可能会引起不必要的随机性, 并进一步限制模型的表达性 。 其次, 我们质疑是否对待符号\ texttrt{ [CLS] 的位置, 与其他词一样是一个合理的设计, 考虑到其在下游任务中的特殊作用( 整个句子的表示) 。 我们从以上分析中激发出一个新的位置编码方法, 叫做\ textbff{Tr} 和 world 嵌入。 在自我保存模块中, TUPE 将词的背景相关性和位置相关性分别与不同参数化( 整个句的表示) { 。 从以上分析中, 我们提出了一个新的位置编码方法, 以\ textb/ text 格式化法 来, 解析化 。, 和 解缩缩化 的模型, 将 解析取出不同的 和 基质化 。