Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.
翻译:定位模型在变异器中起着关键作用。 在本文中, 我们侧重于长程外推法, 即短文本培训, 并评价较长序列。 我们把关注分辨率定义为外推指标。 然后我们提出两种改进上述变异器衡量标准的设计。 具体地说, 我们引入一个相对位置, 以明确最大限度地实现注意分辨率。 此外, 在推论中, 我们使用截断性的因果关注来进行更好的解析。 我们用语言模型来评估不同的变异器。 实验结果显示, 我们的模型在内推和外推环境中都取得了很强的性能。 代码将在 https://aka. ms/LeX- Transfornect 上公布 。