Convolution and self-attention are two powerful techniques for representation learning, and they are usually considered as two peer approaches that are distinct from each other. In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. Specifically, we first show that a traditional convolution with kernel size k x k can be decomposed into k^2 individual 1x1 convolutions, followed by shift and summation operations. Then, we interpret the projections of queries, keys, and values in self-attention module as multiple 1x1 convolutions, followed by the computation of attention weights and aggregation of the values. Therefore, the first stage of both two modules comprises the similar operation. More importantly, the first stage contributes a dominant computation complexity (square of the channel size) comparing to the second stage. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart. Extensive experiments show that our model achieves consistently improved results over competitive baselines on image recognition and downstream tasks. Code and pre-trained models will be released at https://github.com/Panxuran/ACmix and https://gitee.com/mindspore/models.
翻译:进化和自我保护是代表制学习的两种强大的技术,通常被视为是相互区别的两种同侪方法。在本文中,我们表明它们之间存在一种强有力的内在关系,即这两个模式的计算大部分实际上都是以同样的操作完成的。具体地说,我们首先表明,与内核大小 k x k k k 的传统进化可以分解成 k%2 个人 1x1 进化,然后是转移和对等操作。然后,我们把自我注意模块中的查询、钥匙和价值的预测解释为多个 1x1 进化,随后是计算注意权重和价值的汇总。因此,这两个模块的第一阶段是类似的操作。更重要的是,第一阶段是主要的计算复杂性(频道大小),与第二阶段相比较。这一观察自然导致这两种看起来不同的模式的优异化整合,即一种混合模型,既具有自我保存和进化的好处,又具有竞争性进化的进化模型/进化能力,同时在下游/进化模型上持续地显示最起码的进化的进化/进化模型和进化模型。