Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.
翻译:由于线性空间和时间的复杂性,最近变形模型的进展允许史无前例的序列长度。与此同时,相对位置编码(RPE)被提议为有利于古典变异器,包括利用时滞而不是绝对的推理位置。不过,对于最近变异器的线性变异器来说,RPE是无法使用的,因为它要求明确计算注意矩阵,而这正是这种方法所避免的。在本文中,我们弥合了这一差距,并提出了Stochacistic 定位编码,作为生成PE的一种方法,可以用来替代古典变异添加物(硅状)PE和类似RPE的可移动行为。主要的理论贡献是将相交的高斯进程的位置编码和交叉变异结构联系起来。我们展示了我们在长Range Arena基准和音乐生成方面的做法的绩效。