Vision transformers have become one of the most important models for computer vision tasks. While they outperform earlier convolutional networks, the complexity quadratic to $N$ is one of the major drawbacks when using traditional self-attention algorithms. Here we propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity. Modifying few of lines from self-attention, UFO-ViT achieves linear complexity without the degradation of performance. The proposed models outperform most transformer-based models on image classification and dense prediction tasks through most capacity regime.
翻译:视觉变压器已成为计算机视觉任务最重要的模型之一,虽然它们比早先的革命网络要好,但复杂的四边形到美元是使用传统的自我注意算法的主要缺点之一。在这里,我们提议采用UFO-ViT(UFO-Viet Vision Trnasored)这个新方法,通过消除一些非线性来减少自我注意的计算。UFO-VIT从自我注意的线条上修改很少,但是在不降低性能的情况下实现了线性复杂性。提议的模型在图像分类和密集预测任务方面,通过大多数能力制度,优于大多数以变压器为基础的模型。