This paper studies the efficiency problem for visual transformers by excavating redundant calculation in given networks. The recent transformer architecture has demonstrated its effectiveness for achieving excellent performance on a series of computer vision tasks. However, similar to that of convolutional neural networks, the huge computational cost of vision transformers is still a severe issue. Considering that the attention mechanism aggregates different patches layer-by-layer, we present a novel patch slimming approach that discards useless patches in a top-down paradigm. We first identify the effective patches in the last layer and then use them to guide the patch selection process of previous layers. For each layer, the impact of a patch on the final output feature is approximated and patches with less impact will be removed. Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers without affecting their performances. For example, over 45% FLOPs of the ViT-Ti model can be reduced with only 0.2% top-1 accuracy drop on the ImageNet dataset.
翻译:本文通过在特定网络中挖掘多余的计算方法来研究视觉变压器的效率问题。 最近的变压器结构展示了在一系列计算机视觉任务上取得出色业绩的功效。 但是, 类似进化神经网络, 视觉变压器的巨大计算成本仍是一个严重问题。 考虑到注意机制将不同的补丁层逐层合并在一起, 我们提出了一个新颖的补丁瘦化方法, 在自上而下的模式中丢弃无用的补丁。 我们首先确定最后一层的有效补丁, 然后用它们来指导前层的补丁选择过程。 对于每一层, 补丁对最后输出特征的影响是近似的, 并且将消除影响较小的补丁。 基准数据集的实验结果表明, 拟议的方法可以显著降低视觉变压器的计算成本, 而不会影响其性能。 例如, VT- Ti 模型中超过45%的FLOP可以减少, 在图像网络数据集上仅0.2%的顶端-1精度下降 。