Transformer architectures are now central to modeling in natural language processing tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied in the computer vision domain, where 2D images are first segmented into patches and then treated as 1D sequences. Such linearization, however, impairs the notion of spatial locality in images, which bears important visual clues. To bridge the gap, we propose ripple attention, a sub-quadratic attention mechanism for visual perception. In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space. To favor correlations with vicinal tokens yet permit long-term dependencies, we derive the spatial weights through a stick-breaking transformation. We further design a dynamic programming algorithm that computes weighted contributions for all queries in linear observed time, taking advantage of the summed-area table and recent advances in linearized attention. Extensive experiments and analyses demonstrate the effectiveness of ripple attention on various visual tasks.
翻译:转换器结构现在对于自然语言处理任务的建模具有核心作用。 核心是关注机制, 它能让长期依赖性在顺序上有效地建模。 最近, 变压器成功地应用在计算机视觉域, 在那里, 2D 图像首先被分割成补丁, 然后作为1D 序列处理。 但是, 这种线性化会损害图像空间位置的概念, 具有重要的视觉线索。 为了弥合这一差距, 我们提议了一个波纹关注, 一个次赤道关注机制, 用于视觉感知。 在波纹关注中, 不同符号对查询的贡献是根据其在 2D 空间的相对空间距离而加权的。 为了偏向于与 vicin 符号的相关性, 但却允许长期依赖性, 我们通过粘绕式转换来得出空间重量 。 我们进一步设计动态的编程算法, 利用测算区域表和最近线性关注的进展来计算线性查询的加权贡献 。 广泛的实验和分析显示各种视觉任务中波纹关注的效果 。