Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition. They achieve competitive results with CNNs but the lack of the typical convolutional inductive bias makes them more data-hungry than common CNNs. They are often pretrained on JFT-300M or at least ImageNet and few works study training ViTs with limited data. In this paper, we investigate how to train ViTs with limited data (e.g., 2040 images). We give theoretical analyses that our method (based on parametric instance discrimination) is superior to other methods in that it can capture both feature alignment and instance similarities. We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones. We also investigate the transferring ability of small datasets and find that representations learned from small datasets can even improve large-scale ImageNet training.
翻译:视觉变异器(Viet Generals)正在出现,以替代进化神经网络(CNN)进行视觉识别。它们与CNN取得了竞争性结果,但缺乏典型的进化感偏差使他们比普通CNN更渴望数据。他们通常在JFT-300M或至少图像网上接受过预先培训,而且很少用有限的数据对ViTs进行工作研究培训。在本文中,我们调查如何用有限的数据(例如2040图像)培训ViTs。我们进行了理论分析,认为我们的方法(基于参数实例歧视)优于其他方法,因为它能够捕捉特征对齐和实例相似性。当从零到培训维特各主干下的7个小数据集时,我们取得了最先进的结果。我们还调查了小型数据集的传输能力,并发现从小数据集中学会的表述甚至可以改进大规模图像网培训。