Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.
翻译:基于关注机制的变异器在不同领域取得了令人瞩目的成功。然而,关注机制在各个领域取得了令人瞩目的成功。但是,关注机制具有四面形的复杂性,极大地阻碍了变异器处理无数物证,并推广到更大的模型。以前的方法主要利用矩阵倍增的相似性分解和关联性来设计线性关注机制。它们避免通过重新引入诸如地点等诱导性偏差来分散对微小分布的注意力,从而牺牲模式的笼统性和表达性。在本文件中,我们将不受基于流动网络理论的具体诱导偏差的变异器线性化了。我们关注的是,信息从源(价值)到汇(结果)之间通过学习流动能力(意向)的累积。在此框架内,我们运用流量保护特性的特性来设计线性关注机制,并提出线性复杂性的流动性机制。通过保护源竞争和汇源源源流流流流流的流流流流流,从而不使用具体的导性偏差来产生信息性关注。我们通过流动、流动/流动语言从源流到汇(结果)到汇(结果(结果)汇集(结果)到汇(结果/流流流流流流流化/流流化)到汇(结果(结果/流流流化),在长时间序列中产生强大的业绩强化性学习)的功能(包括长/线性学习)的功能、线性学习过程、直线性),在广泛区域。