Vision transformers have recently received explosive popularity, but huge computational cost is still a severe issue. Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, rough token pruning breaks the spatial structure that is indispensable for local spatial prior. To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintain complete spatial structure and information flow. To achieve this goal, we propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the global class attention that is unique to vision transformers. Then, we propose to update information tokens and placeholder tokens that contribute little to the final prediction with different computational properties, namely, slow-fast updating. Thanks to the slow-fast updating mechanism that guarantees information flow and spatial structure, our Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that the proposed method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiT-S by over 60% throughput while only sacrificing 0.4% top-1 accuracy.
翻译:视觉变压器最近受到爆炸性欢迎,但巨大的计算成本仍是一个严重问题。最近对视觉变压器的有效设计遵循了两个管道,即基于当地空间先前和非结构性象征的结构性压缩。然而,粗略的象征性调整打破了当地空间变压器之前所不可或缺的空间结构。为了利用这两个管道,这项工作力求动态地识别每个管道的不信息标志,并缩小培训和推论复杂性,同时保持完整的空间结构和信息流动。为了实现这一目标,我们提议Evo-VIT, 一种自我驱动的慢速信号进化方法,用于视觉变压器。具体地说,我们利用全球阶层的注意力进行非结构化实例化的象征性选择,而这种注意力是视觉变压器的独特。然后,我们提议更新信息代号与占最终预测作用不大的占位符,即慢速更新。由于保证信息流动和空间结构的缓慢更新机制,我们Evo-VIT能够加速范拉变压器的变压器,用于视觉变压器的快速变压器,同时从可快速的变压式的变压方法开始,以快速的变压模型的计算方法来展示我们的变压。