As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a \textbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.
翻译:作为视觉变压器的核心构件块, 注意力是捕捉长距离依赖性的强大工具。 但是, 这种电源是成本很高的: 它带来巨大的计算负担和沉重的记忆足迹, 因为所有空间位置的配对象征性互动会得到计算。 一系列工作试图通过引入手工艺和内容的偏观性来缓解这一问题, 例如将注意力的操作限制在本地窗口、 轴纹条纹或扩展窗口内。 与这些方法相反, 我们提议通过双级的天平路程进行新的动态分散关注, 以便能够更灵活地分配具有内容意识的计算。 具体地说, 对于查询来说, 无关的关键值配对子首先在粗略的区域一级过滤, 然后在其余候选区域( 条纹区域 ) 的联盟中应用精细的标注的标注的标注的标定值。 我们提供简单而有效的双级路程关注, 利用可获取的测量和记忆来节省计算和存储, 同时只涉及新的精密矩阵增缩缩的矩阵。 具体来说, 用于在平面的预估的计算, 。 将预估的标值任务, 。</s>