Vision Transformer (ViT) demonstrates that Transformer for natural language processing can be applied to computer vision tasks and result in comparable performance to convolutional neural networks (CNN), which have been studied and adopted in computer vision for years. This naturally raises the question of how the performance of ViT can be advanced with design techniques of CNN. To this end, we propose to incorporate two techniques and present ViT-ResNAS, an efficient multi-stage ViT architecture designed with neural architecture search (NAS). First, we propose residual spatial reduction to decrease sequence lengths for deeper layers and utilize a multi-stage architecture. When reducing lengths, we add skip connections to improve performance and stabilize training deeper networks. Second, we propose weight-sharing NAS with multi-architectural sampling. We enlarge a network and utilize its sub-networks to define a search space. A super-network covering all sub-networks is then trained for fast evaluation of their performance. To efficiently train the super-network, we propose to sample and train multiple sub-networks with one forward-backward pass. After that, evolutionary search is performed to discover high-performance network architectures. Experiments on ImageNet demonstrate that ViT-ResNAS achieves better accuracy-MACs and accuracy-throughput trade-offs than the original DeiT and other strong baselines of ViT. Code is available at https://github.com/yilunliao/vit-search.
翻译:视觉变异器( VIT) 显示,自然语言处理的变异器可以应用到计算机的视觉任务中,并导致与多年来在计算机愿景中研究和采用的进化神经网络(CNN)的类似性能。这自然地提出了如何用CNN的设计技术提高VIT的性能的问题。为此,我们提议采用两种技术并展示Vit-ResNAS,这是设计神经结构搜索(NAS)的高效的多阶段VIT结构。首先,我们提议将剩余空间缩小以降低更深层层的序列长度,并利用多阶段结构。在缩短长度时,我们增加连接,以改进性能并稳定更深的网络。第二,我们提议以多结构抽样方式分享NAS的性能。我们扩大网络并利用其子网络来定义搜索空间。随后将对所有子网络进行超网络进行快速评估。为了有效地培训超级网络,我们提议对多个子网络进行抽样并培训多个子网络,同时使用一个前向后向式结构。之后,我们提议进行进化搜索,以发现高性系统网络的精确度,然后在 ViRes- revivil- demin- deal- deminal- develop slaut sal sal sal sal slaut sal sal sal sal sal sal sal squt.