Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at \url{https://github.com/mlpen/mra-attention}.
翻译:最近,在培训和部署更高效的变异器方面,确定了许多战略,以近似于自我注意矩阵,这是变异器结构中的一个关键模块。有效的设想包括各种预先指定的聚变模式、低级基扩展及其组合。在本文件中,我们重新审视了传统的多分辨率分析概念,如波子等,这些概念在这一背景下的潜在价值迄今仍未得到探讨。我们显示,基于经验反馈和设计选择的简单近似,以现代硬件和执行挑战为根据,最终产生一种基于MRA的自我注意方法,在大多数感兴趣的标准中都具有出色的业绩特征。我们进行了一系列广泛的实验,并表明这一多分辨率方案优于最有效的自我注意建议,并且有利于短长顺序。代码可在以下网站查阅:https://github.com/mlpen/mra-attention}。