Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
翻译:开发了高效的变异器,用于长序列建模, 原因是其亚深层内存和时间复杂性。 粗化变异器是一种提高变异器效率的流行方法, 将自我注意限制在一个关注层内, 并保持低的计算和记忆成本。 关键的想法是使用注意 Difcolation 来扩大可接受性关注的领域, 利用注意 Difcolent 来计算基于所有路径的多位断开信号的多点符号相关性, 除了相邻的标志之外。 理论上, 我们建议 Diffuser 既能发挥稀疏变异变异器的效率, 也能发挥全端变变异器的表达性, 新的艺术效率变异器 。 Diffuser 将所有象征性的相互作用都包含在一个关注层内, 同时保持低度计算和存储成本的计算成本。 关键的想法是利用注意度 Difcolormality 来扩大最小化的变异功能, 包括分析直径直径直径的图像分析直径直径直径的图像分析, 直径直径直径直径直径直径直径直径直径分析。