We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for nongeometric data, as well as novel RPEs operating on the sequences of tokens embedded in higher-dimensional Euclidean spaces (e.g. point clouds). FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE-mask. FLTs allow also for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and providing accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly tested FLTs on other data modalities and tasks, such as: image classification and 3D molecular modeling. For 3D-data FLTs are, to the best of our knowledge, the first Transformers architectures providing RPE-enhanced linear attention.
翻译:我们提议一个新的线性变换器类别,称为FleierLearner-Transtransts(FLTs),它包含一系列广泛的相对位置编码机制,其中包括用于非地球测量数据的常规 RPE 技术,以及用于高维的Euclidean空间(例如点云)内嵌的标志序列的新型 RPE 技术; FLTs通过学习其光谱代表来间接地构建最佳 RPE 机制; 相对于将高效低级别线性关注与RPEs相结合的其他结构, FLTs在其记忆使用方面仍然实用,不需要对RPE-mask结构的其他假设。 FLTs还允许应用某些结构性的暗示偏差技术来指定遮掩战略,例如,它们提供了学习本文中引入的所谓本地RPE,并与其他几个用于语言建模的线性变换器相比,提供了准确的收益。 我们还对其他数据模式和任务进行了彻底的FLTs,例如:图像分类和3D分子模型。 FLTs提供我们最先进的RPE-D的线性知识结构。