Vision Transformers (VTs) are becoming a valuable alternative to Convolutional Neural Networks (CNNs) when it comes to problems involving high-dimensional and spatially organized inputs such as images. However, their Transfer Learning (TL) properties are not yet well studied, and it is not fully known whether these neural architectures can transfer across different domains as well as CNNs. In this paper we study whether VTs that are pre-trained on the popular ImageNet dataset learn representations that are transferable to the non-natural image domain. To do so we consider three well-studied art classification problems and use them as a surrogate for studying the TL potential of four popular VTs. Their performance is extensively compared against that of four common CNNs across several TL experiments. Our results show that VTs exhibit strong generalization properties and that these networks are more powerful feature extractors than CNNs.
翻译:视觉变异器(VTs)在涉及图像等高维和空间化投入的问题时,正在成为革命神经网络(CNNs)的一个宝贵替代物。然而,这些感知变异器(TL)的特性尚未得到充分研究,而且还不能完全知道这些神经结构能否跨越不同领域以及CNN。在本文中,我们研究在广受欢迎的图像网络数据集上接受过预先培训的VTs是否学会了可转移到非自然图像领域的演示。为了这样做,我们考虑了三个研究周密的艺术分类问题,并把它们用作研究四种受欢迎VTs的TL潜力的替代工具。它们的性能与在TL几个实验中四个普通CNN的性能相比是十分广泛的。我们的结果表明,VTs展示了强大的概括性特征,而这些网络比CNNs更强大。