The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transformers are more robust than CNN, according to the latest research. ViT's self-attention mechanism, according to the claim, makes it more robust than CNN. Even with this, we discover that these conclusions are based on unfair experimental conditions and just comparing a few models, which did not allow us to depict the entire scenario of robustness performance. In this study, we investigate the performance of 58 state-of-the-art computer vision models in a unified training setup based not only on attention and convolution mechanisms but also on neural networks based on a combination of convolution and attention mechanisms, sequence-based model, complementary search, and network-based method. Our research demonstrates that robustness depends on the training setup and model types, and performance varies based on out-of-distribution type. Our research will aid the community in better understanding and benchmarking the robustness of computer vision models.
翻译:视觉变压器(VIT)已发展到视觉识别任务的最前沿。 根据最新研究,变压器比CNN更强大。 VIT的自我关注机制(根据这一说法)使其比CNN更强大。 即使如此,我们发现这些结论基于不公平的实验条件,只是比较了几个模型,这些模型无法让我们描述整个强健性能的情景。在这项研究中,我们调查了58个最先进的计算机视觉模型在统一培训系统中的性能,不仅基于关注和共变机制,而且基于神经网络,这些网络基于共变和关注机制、序列模型、互补搜索和网络方法的组合。我们的研究表明,强性取决于培训的设置和模型类型,以及基于分配类型不同的业绩。我们的研究将帮助社区更好地了解计算机视觉模型的稳健性并设定基准。