Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with a linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. Specifically, we focus on a new paradigm WISE, which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting non-linear transformations in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena benchmark demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform Fourier-based methods.
翻译:变换器及其变体是深层学习的基本神经结构。 最近的工程显示, Fourier 空间的学习注意力可以提高变换器的长序列学习能力。 我们主张,波子变换将是一个更好的选择,因为它既能捕捉位置信息,也能捕捉频率信息,具有线性时间复杂性。 因此,在本文中,我们系统地研究波子变换器和变换器之间的协同作用。 具体地说,我们侧重于一个新的模式WISE,它将变换器的注意力替换为:(1) 应用前波子变换,将输入序列投射到多分辨率基地,(2) 在波子系数空间进行非线性变换,(3) 通过后波变重建输入空间的表示。 长区域基准的广泛实验表明,使用固定或适应波子在波子空间中学习注意力可以不断改善变换器的性能,并且大大超过四面基方法。