Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require $O(N^2)$ computing cost for sequence length $N$. Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as $O(N \log N)$. Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck. We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks. The experimental results show that Paramixer has better performance in most learning tasks.
翻译:自控是神经模型中广泛使用的一个构件,将远程数据元素混合在一起。大多数自控神经网络使用对称点产品来指定注意系数。然而,这些方法要求序列长度的计算成本为$O(N2)2美元。尽管已经采用一些近似方法来缓解二次成本,但点产品方法的性能仍然受到关注矩阵因素分化中低级别制约的阻碍。在本文中,我们提议了一个新的可缩放和有效的混合构件块,叫做帕拉密斯特。我们的方法将互动矩阵分为几个稀疏矩阵,我们用数据元素作为输入来对MLP的非零条目进行参数参数化。新构件的总体计算成本与美元(N\logN)相比低。此外,Palmixer中所有因数矩阵都完全处于低级别,因此不受到低级别瓶的损害。我们已经在合成和各种真实世界的长期连续数据集中测试了新方法。我们用数据序列模型对它进行参数化的参数化,并将它与数项实验性结果进行比较。