Recently, transformer and multi-layer perceptron (MLP) architectures have achieved impressive results on various vision tasks. A few works investigated manually combining those operators to design visual network architectures, and can achieve satisfactory performances to some extent. In this paper, we propose to jointly search the optimal combination of convolution, transformer, and MLP for building a series of all-operator network architectures with high performances on visual tasks. We empirically identify that the widely-used strided convolution or pooling based down-sampling modules become the performance bottlenecks when the operators are combined to form a network. To better tackle the global context captured by the transformer and MLP operators, we propose two novel context-aware down-sampling modules, which can better adapt to the global information encoded by transformer and MLP operators. To this end, we jointly search all operators and down-sampling modules in a unified search space. Notably, Our searched network UniNet (Unified Network) outperforms state-of-the-art pure convolution-based architecture, EfficientNet, and pure transformer-based architecture, Swin-Transformer, on multiple public visual benchmarks, ImageNet classification, COCO object detection, and ADE20K semantic segmentation.
翻译:最近,变压器和多层透视器(MLP)结构在各种愿景任务上取得了令人印象深刻的成果。 少数由手动调查的工程将这些操作者联合起来,设计视觉网络结构,并在一定程度上实现令人满意的性能。 在本文中,我们提议联合搜索组合、变压器和MLP的最佳组合,以建立一系列全操作者网络结构,在视觉任务上表现优异。 我们从经验中发现,广泛使用的以下层抽样为基础的滚动或集合模块在操作者组成网络时成为性能瓶颈。 为了更好地应对变压器和MLP操作者所捕捉的全球环境,我们提议了两个全新的背景下层抽样模块,这些模块可以更好地适应由变压器和MLP操作者编码的全球信息。 为此,我们联合搜索所有操作者和在统一搜索空间的下层抽样模块。 值得注意的是,我们搜索过的网络United20网络(Unificed Net) 超越了基于公共革命的纯度结构、高效的网络、和纯变换的图像系统、Straverical、Syal-Syal、Salifrial、Syal、Syal-ADrial、Saliflifrical、Sildal、Sildal、Sildal、Salifal、Salifal、SLVDIS、SLMIS、SLADADAD等等结构。