In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
翻译:在本文中,我们提供了一个全面的方法,以可扩展的方式将各种掩码机制合并到Transformer体系结构中。我们展示了最近关于线性因果注意力(Choromanski等人,2021)和对数线性RPE-注意力(Luo等人,2021)的结果是这种通用机制的特例。但是,通过将问题视为未掩码注意力的拓扑(基于图形)调制,我们获得了多个以前不为人知的结果,包括高效的d维RPE掩码和图内核掩码。我们利用许多数学技术,从谱分析到动态规划和随机游走,以及解决图上马尔科夫过程的新算法。我们提供相应的实验证明。