Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce \textit{private} class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with $1.3$G FLOPs budget, our searched architecture achieves $74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current baseline ViT architecture. Code is available at \url{https://github.com/xiusu/ViTAS}.
翻译:最近,变压器在解决计算机视觉任务方面表现出了巨大的优势,通过将图像建模成一个以自省机制进行人工分割的图象序列,在解决计算机视觉任务方面表现出了巨大的优势。然而,目前的视觉变压器结构只是从自然语言处理(NLP)任务中继承下来的,没有得到充分的调查和优化。在本文件中,我们又迈出了一步,检查变压器的内在结构以完成视觉任务,并提议了一个称为Vitas的建筑搜索方法,以使用类似的硬件预算来寻找最佳架构。具体地说,我们为VIT设计了一个新的有效但高效的重量共享模式,这种结构可以由单一的超文本语言处理(NLP)任务中衍生出,而且没有得到充分的调查和优化。此外,为了适应Viuburub的搜索,我们提议用类似的硬件预算来搜索身份操作的取样概率。实验结果显示,我们的VITAS获得的极值$,没有不同的象征性嵌嵌嵌、序列、头、宽度和深度的建筑。 此外,为了适应不同结构的精度,我们目前的Sqoural trangeral_QILA,我们的预算是最高结构。