Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR, and demonstrate that it achieves state-of-the-art performance on various vision tasks but at lower computational cost and with fewer parameters. For instance, our ClusTR small model with 22.7M parameters achieves 83.2\% Top-1 accuracy on ImageNet. Source code and ImageNet models will be made publicly available.
翻译:尽管变异器成功地从语言建模来源向基于图像的应用转变了,但其四倍计算复杂性仍然是一个挑战,特别是对于密集的预测来说。在本文中,我们提出了基于内容的分散关注方法,以替代密集的自我关注,目的是降低计算复杂性,同时保留模型长距离依赖性的能力。具体地说,我们分组并随后汇总关键和价值符号,以此作为基于内容的减少总象征性计数的方法。由此形成的组合式控点序列保留了原始信号的语义多样性,但可以以较低的计算成本处理。此外,我们进一步将集群引导的关注从单一尺度扩大到多尺度,有利于密集的预测任务。我们标注了拟议的变异器结构 CIusTR, 并表明它在各种愿景任务上取得了最新业绩,但计算成本较低,参数较少。例如,我们的具有22.7M参数的CLusTR小型模型在图像网络上实现了83.2 ⁇ Top-1的准确度。源代码和图像网络模型将被公开使用。