Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose identity shifting to alleviate the many-to-one issue in superformer and leverage weak augmentation and regularization techniques for more steady training empirically. Based on these, our proposed method, ViTAS, has achieved significant superiority in both DeiT- and Twins-based ViTs. For example, with only $1.4$G FLOPs budget, our searched architecture has $3.3\%$ ImageNet-$1$k accuracy than the baseline DeiT. With $3.0$G FLOPs, our results achieve $82.0\%$ accuracy on ImageNet-$1$k, and $45.9\%$ mAP on COCO$2017$ which is $2.4\%$ superior than other ViTs.
翻译:视觉变压器(Viet 变压器)继承了NLP的成功,但是其结构没有得到充分的调查和优化,也没有为视觉任务优化。因此,最简单的解决办法之一是通过CNN在CNN中广泛使用的神经结构搜索(NAS)直接搜索最优化的网络。然而,我们从经验上发现,这种直接的适应会遇到灾难性的失败,而且对于超级变压器的训练来说会令人沮丧地不稳定。在本文中,我们争辩说,由于ViT主要在象征性嵌入上操作,几乎没有诱导偏差,因此不同结构的渠道不平衡会加剧权重分摊假设,并因此导致培训不稳定。因此,我们为ViT的象征性嵌入(NAS)开发了一个新的循环权重共享机制,使每个频道都能更均衡地为所有候选结构作出贡献。此外,我们还提议改变身份,以缓解超前的许多问题,并利用较弱的增强和正规化技术进行更稳定的培训。在此基础上,我们提出的方法(VITS)20美元将使得基于SiT和双基Vios的Viet公司获得显著的优势。因此导致培训不稳定。我们ViT的ViLO0美元的金额40美元和PIL$的美元。例如14G美元, SAQFOIL3OL3美元的精确度预算。