Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. The code will be available upon acceptance.
翻译:视觉变异器(VTs)正在作为革命网络(CNNs)的建筑范式。 与CNN不同, VTs可以捕捉图像元素之间的全球关系,而且它们可能具有更大的代表性。 但是,由于典型的进化偏差的典型特征,这些模型比普通CNN(CNN)更多的数据饥饿。 事实上,在CNN(VTs)的建筑设计中嵌入的视觉域的一些本地属性应该从样本中学习。 在本文中,我们实验性地分析不同的VT,比较其在小型培训系统中的强健性,并且我们表明,尽管在图像网培训时,它们具有可比的准确性,但其在较小数据集上的性能却可能大不相同。 此外,我们提议了一种自我监督的任务,它可以从图像中提取额外的信息,只有微不足道的计算间接费用。 这项任务鼓励VTs从一个图像中学习空间关系,并在培训数据稀少时使VT培训更加稳健。 我们的任务与标准( 超视) 培训同时使用,并不取决于具体的建筑选择,因此, 其性能轻易地在现有的VT中显示我们最终的准确度。