We provide a detailed evaluation of various image classification architectures (convolutional, vision transformer, and fully connected MLP networks) and data augmentation techniques towards generalization to large spacial translation shifts. We make the following observations: (a) In the absence of data augmentation, all architectures, including convolutional networks suffer degradation in performance when evaluated on translated test distributions. Understandably, both the in-distribution accuracy as well as degradation to shifts is significantly worse for non-convolutional architectures. (b) Across all architectures, even a minimal augmentation of $4$ pixel random crop improves the robustness of performance to much larger magnitude shifts of up to $1/4$ of image size ($8$-$16$ pixels) in the test data -- suggesting a form of meta generalization from augmentation. For non-convolutional architectures, while the absolute accuracy is still low, we see dramatic improvements in robustness to large translation shifts. (c) With sufficiently advanced augmentation ($4$ pixel crop+RandAugmentation+Erasing+MixUp) pipeline all architectures can be trained to have competitive performance, both in terms of in-distribution accuracy as well as generalization to large translation shifts.
翻译:我们详细评估了各种图像分类结构(革命、视觉变压器和完全连接的 MLP 网络)和数据增强技术,以普遍化为大规模和平化翻译转变。我们提出以下意见:(a) 在缺乏数据增强的情况下,所有结构,包括革命网络,在通过翻译测试分布进行评估时,其性能都会退化。可以理解的是,对于非革命结构而言,分配中的准确性以及向转变的降解性都大大恶化。 (b) 在所有结构中,即使最小地增加4美元的像素随机作物,也能够提高性能,使测试数据中的性能大得多地转换到1/4美元的图像大小(8-16美元像素) -- -- 表示从增强中的一种超常化形式。对于非革命结构来说,尽管绝对性精确性仍然很低,但我们看到稳健性向大规模翻译转变的情况却大得多。 (c) 在足够先进的增强性(4美元的像素作物+RandAugation+Erasing+MixUP)的情况下,所有管道结构都可以在大规模转化方面进行竞争性的转变。