Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.
翻译:位置编码是允许基于关注的深层模型结构(如变异器), 处理信息位置重要位置的序列或图像。 在本文中, 我们提出基于可学习的 Fourier 特征的新式位置编码方法。 我们不是将每个位置作为代号或矢量进行硬编码, 而是代表每个位置, 可以是多维的, 是一种基于可学习的 Fourier 特征绘图的可培训的编码, 以多层宽度调制。 表示对于一个空间多维位置( 如图像上的像素位置)特别有利, 例如, 图像上的像素位置需要捕获 $L_ 2$ 的距离或更复杂的定位关系。 我们基于几个公共基准任务进行的实验显示, 我们的多维位置编码可学习的 Fourier 特征代表超越了现有方法, 不仅提高了准确性, 也允许更快的趋同性 。