Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation with attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning.
翻译:基于关注机制的变异器在不同领域取得了令人瞩目的成功,然而,注意机制却具有四面形的复杂性,严重妨碍了变异器处理许多物证,并推广到更大的模型。以前的方法主要利用矩阵倍增的相似分解和关联性来设计线性注意机制。它们避免通过重新引入诸如地点等诱导偏差而转移对微小分布的注意力,从而牺牲了模式的泛泛性和表达性。在本文件中,我们使变异器摆脱基于流动网络理论的特定诱导偏差而线化。我们关注的是,信息从源(价值)到汇(结果)通过学习的流能力(注意)汇集到汇(结果),在此框架内,我们运用流量保护特性来设计线性关注机制,并提出线性复杂性的流动-注意机制。通过分别保护源竞争和汇分配源源流流流流动流动,在不使用具体的导偏差的情况下,必然产生信息性关注。我们通过流动、流动-变向-变换语言在长线性时间、长线性学习领域产生强有力的业绩,包括长线性学习。