Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length, which hinders applications of Transformer-based models to long sequences. Many approaches have been proposed to mitigate this problem, such as sparse attention mechanisms, low-rank matrix approximations and scalable kernels, and token mixing alternatives to self-attention. We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity. We design multi-granularity pooling and pooling fusion to capture different levels of contextual information and combine their interactions with tokens. On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of PoNet and observe that PoNet achieves 96.0% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for PoNet to learn transferable contextualized language representations.
翻译:以变异器为基础的模型在各种NLP、视觉和语言任务中取得了巨大成功,然而,变异器的核心即自我注意机制,在序列长度方面有四倍的时间和记忆复杂性,这阻碍了变异器模型应用到长序列。为缓解这一问题,提出了许多办法,例如关注机制稀少、矩阵近似和可缩缩放的内核,以及自控的代用品混合等。我们提议建立一个新型集合网络(PoNet),用于以线性复杂性的长序列进行象征性混合。我们设计多色联和集合,以捕捉不同程度的背景资料,并将它们与象征物结合起来。在远程阿雷纳基准上,PoNet明显地优于变异器,并实现竞争性准确性,同时仅略微慢于在GPUPS上测量的所有序列长度的最快模型FNet。我们还对PoNet的转移学习能力进行了系统研究,并观察到PoNet在GLUE的直径网络定位基准上实现了96.0%的准确性,将FROMLULUM 的流流流率性模型化为4.5。