The three existing dominant network families, i.e., CNNs, Transformers, and MLPs, differ from each other mainly in the ways of fusing spatial contextual information, leaving designing more effective token-mixing mechanisms at the core of backbone architecture development. In this work, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate flexible contextual information distributed across different channels from other tokens into the given query token. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the query token at channel level. In this way, the spatial range of token-mixing can be expanded to a global scope with limited computational complexity, where the way of token-mixing is reformed. We take ATM as the primary operator and assemble ATMs into a cascade architecture, dubbed ATMNet. Extensive experiments demonstrate that ATMNet is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. Code is available at https://github.com/microsoft/ActiveMLP.
翻译:三个现有占主导地位的网络家庭(即CNN、CNN、变换器和MLP)彼此不同,主要表现在以下两种方式上:模糊空间背景信息,将更有效的象征性混合机制设计成为主建架构发展的核心;在这项工作中,我们提议一个创新的象征性混合器(称为ATM),称为ATM,称为ATM,将ATM组装成一个连锁结构,称为ATMNet。广泛的实验表明,ATMNet一般适用,而且全面超过SOTA视觉骨干中不同家庭的不同家庭,在一系列广泛的视觉任务(包括视觉识别和密集的预测任务)上有明确的间隔,包括视觉识别和精确的预测任务。