空间元体作为视野变形器的感性导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导导</s> (Spatial Entropy as an Inductive Bias for Vision Transformers)

Elia Peruzzo,Enver Sangineto,Yahui Liu,Marco De Nadai,Wei Bi,Bruno Lepri,Nicu Sebe

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

翻译：最近关于视觉变异器的工作表明,在VT结构中引入局部感化偏差有助于减少培训所需的样本数量,然而,建筑修改导致变异器主干部失去一般性,部分与开发统一结构的推动相矛盾,例如计算机视野和自然语言处理领域共同开发的统一结构。在这项工作中,我们提出了一个不同和互补的方向,即使用辅助性自我监督任务引入局部偏差,与标准监督培训共同执行。具体地说,我们利用这样一种观察,即VT的注意图在经过自我监督培训时,可包含一个自发产生的语义分解结构,而当培训受到监督时,这种结构不会自动出现。因此,我们明确鼓励将这种空间组合作为一种培训正规化的形式出现。我们利用这样一种假设,即在一个特定图像中,目标通常与少数关联区域相对应,我们建议用空间化信息导流,以量化这种基于目标的偏差。我们提议的空间变异性图,通过尽量减少空间变异性图,我们把一个额外的自我变异性变异性结构纳入一个我们提议的自我调整的基本结构。</s>