Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance in this paper. Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image. (2) Most images can be well recognized by a ViT model in a small-length token sequence. Therefore, our CF-ViT implements network inference in a two-stage manner. At coarse inference stage, an input image is split into a small-length patch sequence for a computationally economical classification. If not well recognized, the informative patches are identified and further re-split in a fine-grained granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For example, without any compromise on performance, CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
翻译:计算机视觉变异器(VIT)在计算机视觉任务中取得了许多突破,然而,在输入图像的空间维度方面出现了大量冗余,导致大量计算成本。因此,我们建议使用粗到粗的视觉变异器(CF-VIT)来减轻计算负担,同时保留本文中的性能。我们提议的CF-ViT的动机是现代ViT模型中的两项重要观察:(1) 粗微的分块可以定位输入图像的信息区。(2) 多数图像都可以在微小的象征性序列中被ViT模型充分识别。因此,我们的CF-ViT以两阶段的方式执行网络推断。在粗略的推论阶段,输入图像被分成成一个小的计算经济分类的小型补位序列。如果没有很好地认识到,则会发现信息区间隙,并在细微的颗粒中进一步重新插入。广泛的实验表明我们的CF-VT的功效。例如,在性能方面没有任何妥协的情况下,CF-VT会降低53%的FLOPs。