Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.
翻译:多头关注赋予了变压器最近的成功,即最先进的模型,这些模型在序列建模内外都取得了显著的成功。这些注意机制在查询和钥匙之间计算对称点产品,这是使用非正常高斯内核产生的,假设查询是高斯分布的混合体。不能保证这一假设在实践中是有效的。我们首先将变压器中的注意解释为非参数内核回归。我们然后提议FourierFormer, 一种新的变压器,其中的点产品内核被新型的四面形集成内核取代。不同于圆点产品内核,我们需要选择一个良好的共变换矩阵,以捕捉取数据特征的依赖性。通用的四面内核内核可以自动捕捉到这种依赖性,并消除调调调调调调调调调的内核矩阵。我们从理论上证明我们提议的四面形内核(Fourier Fort-内核)能够有效地接近任何关键和查询式的分布。比较了四面变压器的精确度和四面变压模型,从而降低了四面变压机的稳定性。