Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to $O\left(n^{1.5}d\right)$ from $O\left(n^2d\right)$ for sequence length $n$ and hidden dimension $d$. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192.
翻译:最近对一系列广泛的序列建模问题采取了自我关注的做法。 尽管它有效, 自我关注在序列长度方面有二次计算和内存要求。 减少这种复杂性的成功方法侧重于关注本地滑动窗口或一组与内容无关的小型地点。 我们的工作建议从 $Oleft( n=1.5}\\right) 学习动态的少关注模式, 避免将计算和记忆分配到与感兴趣查询无关的内容内容。 这项工作基于两条研究线: 它将以前基于内容的微小关注的模型灵活性与基于本地、 时间稀少关注的方法所带来的效率收益结合起来。 我们的模型, Rout 变异器, 以基于基于在线 k 方式的稀少的运行模块为自我关注自省略。 我们的工作建议从 $Oleft( n=1.5}\right) 中学习动态分散的关注模式, 避免分配计算和内存内容, 以美元和隐藏的维基文本- 103 (15.8 vs 18.3) 的流动自省关注模式, 将经过训练的 R- dreflist listality asion a magement on we kind- dreflistaldal- drealtiewdaldaldal.