Convolutional Neural Networks (CNNs), architectures consisting of convolutional layers, have been the standard choice in vision tasks. Recent studies have shown that Vision Transformers (VTs), architectures based on self-attention modules, achieve comparable performance in challenging tasks such as object detection and semantic segmentation. However, the image processing mechanism of VTs is different from that of conventional CNNs. This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks. To address these questions, we study and compare VT and CNN architectures as feature extractors in object detection and semantic segmentation. Our extensive empirical results show that the features generated by VTs are more robust to distribution shifts, natural corruptions, and adversarial attacks in both tasks, whereas CNNs perform better at higher image resolutions in object detection. Furthermore, our results demonstrate that VTs in dense prediction tasks produce more reliable and less texture-biased predictions.
翻译:由革命性神经网络(CNNs)组成的结构,由革命性神经网络(CNNs)组成,一直是愿景任务的标准选择。最近的研究显示,基于自我注意模块的愿景变异器(VTs),在物体探测和语义分割等具有挑战性的任务中取得了可比的绩效。然而,VT的图像处理机制不同于常规CNN。这提出了几个问题,涉及它们的一般性、稳健性、可靠性和用于提取复杂任务特征时的纹理偏差。为了解决这些问题,我们研究并比较VT和CNN的架构,将其作为物体探测和语义分割的特征提取器。我们广泛的实证结果表明,VTs产生的特征对于分布变化、自然腐败和对立攻击都更为强大,而CNN在物体探测中更能使用更高的图像分辨率。此外,我们的结果显示,在密集预测任务中的VTs产生更可靠和更少的文字偏向性的预测。