Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave differently. To investigate the reliability of ViT, this paper studies the behavior and robustness of ViT. We compared the robustness of CNN and ViT by assuming various image corruptions that may appear in practical vision tasks. We confirmed that for most image transformations, ViT showed robustness comparable to CNN or more improved. However, for contrast enhancement, severe performance degradations were consistently observed in ViT. From a detailed analysis, we identified a potential problem: positional embedding in ViT's patch embedding could work improperly when the color scale changes. Here we claim the use of PreLayerNorm, a modified patch embedding structure to ensure scale-invariant behavior of ViT. ViT with PreLayerNorm showed improved robustness in various corruptions including contrast-varying environments.
翻译:视觉变压器(ViTs)最近展示了各种视觉任务中最先进的性能,取代了进化神经网络(CNNs ) 。 与此同时,由于ViT有与CNN不同的结构,它的行为可能有所不同。为了调查ViT的可靠性,本文研究了ViT的行为和稳健性。我们比较了CNN和ViT的强健性,假设在实际的视觉任务中可能出现各种图像腐败。我们确认,对于大多数图像转换而言,ViT表现出与CNN的强力或更大的强力。然而,为了提高对比,ViT持续观察到严重的性能退化。我们从详细分析中发现了一个潜在问题:在ViT的嵌入补时,在彩色比例变化时,定位嵌入的定位可能运作不当。我们在这里声称使用Pre LayerNorm,一个经过修改的嵌合结构,以确保ViT.ViT和PreLayerNorm在对比环境等各种腐败中的强性。