While local-window self-attention performs notably in vision tasks, it suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We propose MixFormer to find a solution. First, we combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields. Second, we propose bi-directional interactions across branches to provide complementary clues in the channel and spatial dimensions. These two designs are integrated to achieve efficient feature mixing among windows and dimensions. Our MixFormer provides competitive results on image classification with EfficientNet and shows better results than RegNet and Swin Transformer. Performance in downstream tasks outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS. Code is available at \url{https://github.com/PaddlePaddle/PaddleClas}.
翻译:虽然当地窗口自我关注在视觉任务方面表现突出,但受受欢迎的场面有限,建模能力问题薄弱,这主要是因为它在不受过度限制的窗口内进行自我关注,在频道尺寸上分享权重。我们建议MixFormer寻找解决方案。首先,我们将本地窗口自我关注与深度和深度融合结合起来,在平行设计中建模跨窗口连接,以扩大可接受域。第二,我们提议各分支之间的双向互动,以提供频道和空间层面的补充线索。这两种设计是综合的,以实现窗口和维度之间的高效混合。我们的MixFormer在图像分类上提供了与高效网络的竞争性结果,并展示了比RegNet和Swin变异器更好的结果。下游任务的业绩通过显著的利润优于其替代方法,在MS COCO、ADE20k和LVIS的5项密集预测任务中,计算成本较低。代码可在以下https://github.com/Padledpadddledle/PadClass}。