GLiT:全球和地方图像转换器的神经结构搜索 (GLiT: Neural Architecture Search for Global and Local Image Transformer)

We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition. Recently, transformers without CNN-based backbones are found to achieve impressive performance for image recognition. However, the transformer is designed for NLP tasks and thus could be sub-optimal when directly used for image recognition. In order to improve the visual representation ability for transformers, we propose a new search space and searching algorithm. Specifically, we introduce a locality module that models the local correlations in images explicitly with fewer computational cost. With the locality module, our search space is defined to let the search algorithm freely trade off between global and local information as well as optimizing the low-level design choice in each module. To tackle the problem caused by huge search space, a hierarchical neural architecture search method is proposed to search the optimal vision transformer from two levels separately with the evolutionary algorithm. Extensive experiments on the ImageNet dataset demonstrate that our method can find more discriminative and efficient transformer variants than the ResNet family (e.g., ResNet101) and the baseline ViT for image classification.

翻译：我们引入了第一种神经结构搜索(NAS) 方法, 以寻找更好的变压器图像识别结构。最近, 发现没有CNN基脊椎的变压器能够实现令人印象深刻的图像识别性能。然而, 变压器是为NLP 任务设计的, 因此当直接用于图像识别时可能是亚最佳的。为了提高变压器的视觉表现能力, 我们建议了一个新的搜索空间和搜索算法。具体地说, 我们引入了一个地点模块, 以更低的计算成本在图像中进行本地的对比模型。使用地点模块, 我们的搜索空间被定义为让搜索算法在全球和地方信息之间自由交换, 以及优化每个模块的低级别设计选择。为了解决由巨大搜索空间造成的问题, 提议了一种等级神经结构搜索方法, 以便从两个层次分别与进化算法分别搜索最佳的图像变压器。在图像网络数据集上进行的广泛实验表明, 我们的方法可以找到比ResNet家族( 例如, ResNet101) 和图像分类基准 VIT 更具有歧视性和效率的变压和变压。