Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.
翻译:视觉变异器(VTs)正在作为革命网络(CNNs)的建筑范式替代。 与CNN不同, VTs可以捕捉图像元素之间的全球关系,而且它们具有更大的代表能力。 然而,由于典型的演进导偏差的缺乏,这些模型比普通CNN(CNN)更多的数据饥饿。 事实上,在CNN建筑设计中嵌入的视觉域的一些本地属性,应该从样本中学习。 在本文中,我们实验性地分析不同的VT,比较其在小型培训机制中的强健性,比较它们在图像网络上培训时具有可比的准确性,并且我们表明,尽管在图像网络上培训时,它们的性能具有可比的准确性,但它们在较小数据集上的性能可能大不相同。 此外,我们提议了一种自我监督的任务,可以从只有微不足道的计算间接费用的图像中提取更多的信息。 这项任务鼓励VTs在培训数据少的时候学习空间关系,使VT培训更加稳健。 我们的任务与标准( 超式) 培训同时使用,并且并不取决于具体的建筑选择,因此,它们的表现可能很容易地插入我们的VT方法。 。