Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.
翻译:视觉Transformers (ViTs)引发了计算机视觉领域最新和最显著的突破。它们的高效设计主要是由计算复杂度(即FLOPs)这种间接度量标准指导的,然而这种标准对于直接的度量标准(如吞吐量)存在明显的差距。因此,我们提出使用在目标平台上的直接速度评估作为高效 ViTs 设计的原则。特别是,我们引入 LITv2,一种简单且有效的 ViT,它在一系列不同模型大小上表现优越,并具有更快的速度,超过了现有的最先进方法。在 LITv2 的核心是一种新颖的自我注意机制,我们将其称为 HiLo。HiLo 的灵感来源于高频率被认为捕捉到局部细节,而低频率则集中在全局结构,而多头自我注意层忽略了不同频率的特征。因此,我们提出通过将头分成两组来在注意层中分离高/低频率模式,其中一组通过每个局部窗口中的自我注意力编码高频率,而另一组通过将每个窗口和每个查询位置在输入特征映射中的平均池化低频率键和值之间执行全局注意力来编码低 频率。受益于两组的高效设计,我们展示 HiLo 优于现有的注意力机制,通过在GPU和CPU上全面基准测试FLOPs、速度和内存消耗。例如,HiLo 在CPU上比空间降采样注意力快1.4倍,在局部窗口注意力上快1.6倍。借助 HiLo 的力量,LITv2 成为主流视觉任务的强大骨干,包括图像分类、密集检测和分割。代码可在 https://github.com/ziplab/LITv2获取。