Vision Transformer are very popular nowadays due to their state-of-the-art performance in several computer vision tasks, such as image classification and action recognition. Although the performance of Vision Transformers have been greatly improved by employing Convolutional Neural Networks, hierarchical structures and compact forms, there is limited research on ways to utilize additional data representations to refine the attention map derived from the multi-head attention of a Transformer network. This work proposes a novel attention mechanism, called multi-manifold attention, that can substitute any standard attention mechanism in a Transformer-based network. The proposed attention models the input space in three distinct manifolds, namely Euclidean, Symmetric Positive Definite and Grassmann, with different statistical and geometrical properties, guiding the network to take into consideration a rich set of information that describe the appearance, color and texture of an image, for the computation of a highly descriptive attention map. In this way, a Vision Transformer with the proposed attention is guided to become more attentive towards discriminative features, leading to improved classification results, as shown by the experimental results on several well-known image classification datasets.
翻译:视觉变异器现在非常受欢迎,因为其在若干计算机视觉任务(如图像分类和动作识别)中表现最先进的表现。虽然通过使用进化神经网络、等级结构和紧凑形式大大提高了视觉变异器的性能,但对于如何利用更多的数据表述来完善从变异器网络多头目关注的多头目中得出的关注地图的研究有限。这项工作提出了一个新的关注机制,称为多重关注,可以取代以变异器为基础的网络中任何标准关注机制。拟议的关注模型将三个不同的不同的方块(即Euclidean、Symmatimic 阳性缺陷和 Grassmann)中的输入空间作为分解,这三种不同的方块具有不同的统计和几何特性,指导网络考虑描述图像的外观、颜色和纹理的丰富信息,用于计算高度描述性关注的地图。在这种方式下,一个具有拟议关注的愿景变异器被引导为更关注偏向歧视性特征,从而导致更好的分类结果,正如几个著名图像分类数据的实验结果所显示的那样。