There are two de facto standard architectures in recent computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Strong inductive biases of convolutions help the model learn sample effectively, but such strong biases also limit the upper bound of CNNs when sufficient data are available. On the contrary, ViT is inferior to CNNs for small data but superior for sufficient data. Recent approaches attempt to combine the strengths of these two architectures. However, we show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes by comparing various models' accuracy on subsets of sampled ImageNet at different ratios. In addition, through Fourier analysis of feature maps, the model's response patterns according to signal frequency changes, we observe which inductive bias is advantageous for each data scale. The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance. To obtain a model with flexible inductive bias on the data scale, we show reparameterization can interpolate inductive bias between convolution and self-attention. By adjusting the number of epochs the model stays in the convolution, we show that reparameterization from convolution to self-attention interpolates the Fourier analysis pattern between CNNs and ViTs. Adapting these findings, we propose Progressive Reparameterization Scheduling (PRS), in which reparameterization adjusts the required amount of convolution-like or self-attention-like inductive bias per layer. For small-scale datasets, our PRS performs reparameterization from convolution to self-attention linearly faster at the late stage layer. PRS outperformed previous studies on the small-scale dataset, e.g., CIFAR-100.
翻译:最近的计算机愿景中有两个事实上的标准架构: 革命神经网络(CNNs) 和愿景变异器(ViTs) 。 强烈的进化偏差有助于模型有效学习样本, 但这种强烈的偏差也限制了CNN的上限。 相反, ViT在小数据上比CNN低,但在足够数据上却优于CNN。 最近试图将这两个架构的优势结合起来的方法。 然而, 我们显示这些方法忽略了, 最佳进化偏差也会随着目标数据比例的变化而变化。 通过比较抽样图像网子子群中不同比例的不同模型的精确度。 此外, 通过对地貌图的四重分析, 该模型的响应模式模式模式在信号频率变化中限制了上限。 我们观察到了每个数据规模的进化偏差。 在ViT型模型的变异变异性模型比ResNet的性表现快得多时, 数据规模要小一些。 要获得一个在数据规模上具有弹性的进化偏差的模型, 我们显示自我变异变变变变变的自我, 的自我分析会持续到我们变的自我变变变变的自我。