Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group performs the attention to model the global relationship between the average-pooled low-frequency keys from each window and each query position in the input feature map. Benefit from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking on FLOPs, speed and memory consumption on GPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/zip-group/LITv2.
翻译:视觉变异器(ViTs)引发了计算机视野中最新的重大突破。 他们的高效设计主要以计算复杂性的间接衡量标准为指导, 即FLOPs, 但它与直接衡量标准( 如吞吐量) 存在明显差距。 因此, 我们提议在目标平台上使用直接速度评价作为高效维Ts 的设计原则。 特别是, 我们引入了LITv2, 一种简单而有效的ViT, 在一个不同模型规模不同、 速度更快的模型分类中, 与现有最先进的方法相比, 效果良好。 LITv2 的核心是一个全新的自我关注机制, 我们称之为Hilo。 HiLLL的灵感来自图像中高频率捕捉当地精度细节和低频率关注全球结构, 而多头自我关注层忽视不同频率的特征。 因此, 我们建议将高/ 低频率的ViT, 将头结构分为两个组, 其中一组通过每个本地窗口内的自我关注高频率, 另一组则通过高清晰度的服务器定位定位, 显示我们每个核心的视野结构, 的定位, 以及每个核心的LVL 。