Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, $N$. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.
翻译:视觉变压器已成为计算机视觉任务最重要的模型之一。 虽然它们比先前的工程要好, 但是它们需要重的计算资源, 其规模是四级制, 相当于象征性的数( $N) 。 这是传统自我注意( SA) 算法的一大缺陷 。 在这里, 我们提议 X- ViT, ViT, 带有具有线性复杂性的新型SA 机制。 这项工作的主要方法就是消除原SA 的非线性 。 我们将SA 机制的矩阵乘法化而没有复杂的线性近似 。 通过修改最初 SA 的几行代码, 提议的模型在图像分类和大多数能力系统中的密集预测任务上, 超过了最基于变压器的模型 。