IA-RED$2美元:为愿景变换者减少可解释性-软件冗余 (IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers)

The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory costs. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4x speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http://people.csail.mit.edu/bpan/ia-red/.

翻译：以自我关注为基础的模型变压器最近正在成为计算机视觉领域的主要支柱。尽管变压器在各种视觉任务中取得了令人印象深刻的成功, 但它仍然承受着沉重的计算和密集的记忆成本。为解决这一限制, 本文展示了一个解释性- Aware 降温框架( IA- RED$=2$) 。我们首先观察大量多余的计算, 主要是用在与不相干的投入补上, 然后引入一个可解释的模块, 以动态和优雅的方式丢弃这些多余的补丁。这个新颖的框架随后扩展为等级结构, 使不同阶段的不相干符号逐渐消失, 从而导致计算成本的大幅缩缩缩缩。我们在图像和视频任务上都进行了广泛的实验, 我们的方法可以达到1.4x速度, 比如DeiT 和TimeSworth 等最新模型, 只能牺牲不到0.7%的准确度。更重要的是, 与其他加速的方法相反, 我们的方法具有内在的解释性, 具有实质性的视觉证据, 使视觉转换器更接近于更接近于原始的图像/ 。通过自然的变形的变形结构, 展示这些变形的变形模型, 能够显示我们的原始的变形结构。我们的变形的变形的变形的变形, 这些变形的变形的变形的变形的变形的变形的变形的变形的变形的变形的变形的变形法, 以比较轻的变形的变形的变形的变形的变形的变形法, 。