Vision transformers have recently received explosive popularity, but the huge computational cost is still a severe issue. Recent efficient designs for vision transformers follow two pipelines, namely, structural compression based on local spatial prior and non-structural token pruning. However, token pruning breaks the spatial structure that is indispensable for local spatial prior. To take advantage of both two pipelines, this work seeks to dynamically identify uninformative tokens for each instance and trim down both the training and inference complexity while maintaining complete spatial structure and information flow. To achieve this goal, we propose Evo-ViT, a self-motivated slow-fast token evolution method for vision transformers. Specifically, we conduct unstructured instance-wise token selection by taking advantage of the global class attention that is unique to vision transformers. Then, we propose to update informative tokens and placeholder tokens that contribute little to the final prediction with different computational priorities, namely, slow-fast updating. Thanks to the slow-fast updating mechanism that guarantees information flow and spatial structure, our Evo-ViT can accelerate vanilla transformers of both flat and deep-narrow structures from the very beginning of the training process. Experimental results demonstrate that the proposed method can significantly reduce the computational costs of vision transformers while maintaining comparable performance on image classification. For example, our method accelerates DeiTS by over 60% throughput while only sacrificing 0.4% top-1 accuracy.
翻译:视觉变压器最近受到爆炸性欢迎,但巨大的计算成本仍是一个严重问题。最近对视觉变压器的有效设计遵循了两个管道,即基于当地空间先前和非结构性象征的结构性压缩,但象征性的修剪打破了当地空间变压器之前所不可或缺的空间结构。为了利用这两个管道,这项工作力求动态地识别每个管道的不信息标志,并缩小培训和推断复杂性,同时保持完整的空间结构和信息流动。为了实现这一目标,我们提议Evo-VIT,即以自我驱动的慢速信号进化方法为视觉变压器。具体地说,我们利用全球阶级的注意力来进行非结构化实例化的代号选择,而这种注意力对于视觉变压器是独一无二的。然后,我们提议更新信息化的代用符和占位符对于最终预测贡献不大,即慢速更新。由于保证信息流动和空间结构的更新机制缓慢,我们Evo-VIT能够加速范拉变压器的变压器,对于视觉变压器的变压器和深压器的变压器选择方法,同时从可大幅地展示我们的变压式变压式变压法的模型的模型。