Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
翻译:视觉变异器(VIT)在图像分类、物体探测和语义图像分割等各种视觉应用中表现出了高度竞争性的性能,与进化神经网络相比,一般发现视觉变异器较弱的感知偏差导致在进行小型培训数据集培训时,更加依赖模型规范化或数据增强(简称“AugReg”)。我们进行了系统的经验研究,以便更好地了解培训数据数量、AugReg、模型大小和计算预算之间的相互作用。我们通过这项研究发现,增加的计算和AugReg的结合可以产生模型,其性能与在数量级级更高、培训数据上培训的模型相同:我们在公共图像网络-21k数据集上培训不同大小的Vitt模型,这些模型与在较大但无法公开获得的JFT-300M数据集方面培训的对应方相匹配或超出。