Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.
翻译:视觉变压器(VIT)是一个关注神经网络架构,显示对计算机视觉任务有效。然而,与ResNet-18相比,与ResNet-18具有类似参数, VIT在小数据集培训方面评估准确性要低得多。 为方便相关领域的研究,我们提供了直觉,以帮助理解为什么情况如此。我们首先比较两个模型的性能,并证实VIT在小数据集培训方面比ResNet-18的性能要低。然后我们通过显示VIT的注意图目视化和ResNet-18的地貌图可视化来解释结果。通过代表相似性视角进一步分析差异。我们的结论是,在小数据集方面受过培训的VIT的代表性与在大型数据集方面受过培训的VIT的代表性差别很大,这可能是性能在小数据集上大量下降的原因。