Recent work has shown that the attention maps of Vision Transformers (VTs), when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. In this paper, we explicitly encourage the emergence of this spatial clustering as a form of training regularization, this way including a self-supervised pretext task into the standard supervised learning. In more detail, we propose a VT regularization method based on a spatial formulation of the information entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT to produce spatially ordered attention maps, this way including an object-based prior during training. Using extensive experiments, we show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures. The code will be available upon acceptance.
翻译:最近的工作表明,视野变异器(VTs)的注意地图如果经过自我监督的培训,可以包含一个语义分解结构,在培训受到监督时不会自动出现。在本文件中,我们明确鼓励将这种空间分组作为培训正规化的一种形式,将自我监督的托辞任务纳入标准监督的学习中。更详细地说,我们提出基于信息昆虫空间配置的VT正规化方法。我们通过尽量减少拟议的空间昆虫,明确要求VT制作空间订购的注意图,包括培训前的物体。我们利用广泛的实验,表明拟议的规范化方法有利于不同的培训情景、数据集、下游任务和VT结构。一旦被接受,即可提供代码。