Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.
翻译:模型的预测保证模型的预测能够捕捉到数据中的关键对称性。 当图像被翻译或旋转时, 一个等式模型对该图像的表示将相应地进行翻译或旋转。 进化神经网络的成功历史上一直与结构中直接编码的翻译等同性联系在一起。 视觉变压器在建筑结构上没有明显的建筑偏向对等性的建筑偏向, 挑战了这一描述, 并暗示扩增和培训数据也可能在它们的性能中起到重要作用。 为了更好地了解最近视觉模型中等同性的作用, 我们引入了利派衍生物, 这是一种测量等同性的方法, 以强大的数学基础和最小的超参数来衡量等同性。 我们利用利派衍生物, 我们研究数百个预选模型的等异性特性, 跨越CNN、 变压器和混合器结构。 我们的分析规模使我们能够将结构的影响与模型规模或培训方法等其他因素相分离。 令人惊讶的是, 许多不等异性的行为可以与空间变现性联系起来, 更精确的网络的显示结构层次, 而不是更精确的模型, 更精确的变压性。