Evo-VIT:动态愿景变异器的缓慢快速进化 (Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer)

from arxiv, We propose a novel and effective design for dynamic vision transformer to achieve better computational efficiency. The code is available at https://github.com/YifanXu74/EvoViT

Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue. Since the computation complexity of ViT is quadratic with respect to the input sequence length, a mainstream paradigm for computation reduction is to reduce the number of tokens. Existing designs include structured spatial compression that uses a progressive shrinking pyramid to reduce the computations of large feature maps, and unstructured token pruning that dynamically drops redundant tokens. However, the limitation of existing token pruning lies in two folds: 1) the incomplete spatial structure caused by pruning is not compatible with structured spatial compression that is commonly used in modern deep-narrow transformers; 2) it usually requires a time-consuming pre-training procedure. To tackle the limitations and expand the applicable scenario of token pruning, we present Evo-ViT, a self-motivated slow-fast token evolution approach for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the simple and effective global class attention that is native to vision transformers. Then, we propose to update the selected informative tokens and uninformative tokens with different computation paths, namely, slow-fast updating. Since slow-fast updating mechanism maintains the spatial structure and information flow, Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that our method significantly reduces the computational cost of vision transformers while maintaining comparable performance on image classification.

翻译：视觉变压器(ViTs)最近受到爆炸性欢迎,但巨大的计算成本仍是一个严重问题。由于ViT的计算复杂性与输入序列长度的长度相比是四倍的,因此计算削减的主流模式是减少象征性物的数量。现有的设计包括结构化的空间压缩,使用逐渐缩小的金字塔来减少大型地貌图的计算,以及动态地丢弃多余物标的无结构化象征性调整方法。然而,现有象征性裁剪的局限在于两个折叠:1) 由裁剪造成的不完整的空间结构与现代深度变压器通常使用的结构性空间压缩不兼容;2 通常需要一个耗时的预培训程序。要克服限制并扩大可应用的象征性调整方案,我们介绍Evo-ViT,一种自动的慢速变速进式演进方法。具体地,我们利用对视觉变压器的简单而有效的全球级关注进行不结构选择。然后,我们提议更新所选的指示性变压式变压器和不固定的变压式变压式变压器结构,同时大幅更新我们快速的变压式变压式变压式变压式变压式的变压方法。