The Vision Transformer (ViT) architecture has recently achieved competitive performance across a variety of computer vision tasks. One of the motivations behind ViTs is weaker inductive biases, when compared to convolutional neural networks (CNNs). However this also makes ViTs more difficult to train. They require very large training datasets, heavy regularization, and strong data augmentations. The data augmentation strategies used to train ViTs have largely been inherited from CNN training, despite the significant differences between the two architectures. In this work, we empirical evaluated how different data augmentation strategies performed on CNN (e.g., ResNet) versus ViT architectures for image classification. We introduced a style transfer data augmentation, termed StyleAug, which worked best for training ViTs, while RandAugment and Augmix typically worked best for training CNNs. We also found that, in addition to a classification loss, using a consistency loss between multiple augmentations of the same image was especially helpful when training ViTs.
翻译:视觉变异器(VIT)架构最近在许多计算机视觉任务中取得了竞争性业绩。 VIT的动机之一是与进化神经网络(CNNs)相比,较弱的感化偏差。然而,这也使得VIT更难培训。它们需要大量的培训数据集、高度的正规化和强大的数据扩增。用于培训VIT的数据增强战略大部分是从CNN培训中继承的,尽管两个架构之间存在巨大差异。在这项工作中,我们实证地评估了CNN(ResNet)与VIT图像分类结构所执行的不同数据增强战略。我们引入了一种风格化的数据增强(StyleAug),称为StyleAug,最能培训VITs,而RandAugment和Augmix通常最能培训CNN。我们还发现,除了分类损失之外,使用同一图像的多重增异功能之间的一致性损失在培训VITs时特别有用。