Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which ignores the sampling theorem, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.
翻译:然而,这些结构结构对于输入图像的翻译并不具有差异性。这种差异的缺乏是由于使用忽略抽样理论的斜面和完全相连的、缺乏空间推理的分层。我们表明,由于相邻的像素之间具有足够的相似性(我们称之为当地同质性),斜面可以极大地有利于翻译。我们还注意到,这一特征是特定数据集,决定了翻译时需要的集合内核大小和斜面之间的关系。此外,我们发现,在集中内核大小的情况下,一般化和翻译之间的权衡是存在的,因为较大的内核大小会导致更好的差异,但比较贫穷的概括化。最后,我们探索了拟议的其他解决办法的功效,即全球平均集中、反丑化和数据扩增,无论是从经验角度还是从当地同质的角度来看都是如此。