Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN (+2.2% top-1 on ImageNet-1k for a ResNet50-RS) as well as substantially improved robustness (+11% top-1 on ImageNet-C). We analyze the representations learnt by the T-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention. Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.
翻译:虽然混合模型试图缩小这两个结构之间的差距,但是它们所依赖的自我注意层却引致了一个强大的计算瓶颈,特别是在广大的空间分辨率上。在这项工作中,我们探索了如何缩短这些层次所花费的培训时间的想法,将它们初始化为共变层。这使我们能够顺利地从任何受过预先训练的CNN转变为其功能相同的混合模型,称为CNN(T-CNN)。由于只有50个微调阶段,由此产生的T-CNN显示出了CNN在CNN上(在图像网-1k上为ResNet50-RS加2.2%的顶部-1)上的显著的绩效增益,以及大幅提高的稳健性(在图像网-C上加11%顶部-1)。我们分析了T-CNN所学的表述,更深入地揭示了共变和自我注意之间的富有成效的相互作用。最后,我们从一个经过部分培训的CNN试验的T-CN,发现它的表现优于从头部受训的相应混合模型,同时减少了培训时间。