Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution ($\textit{g}^\textit{n}$Conv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. $\textit{g}^\textit{n}$Conv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show $\textit{g}^\textit{n}$Conv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that $\textit{g}^\textit{n}$Conv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs. Code is available at https://github.com/raoyongming/HorNet
翻译:视觉变换器最近的进展显示, 在基于点产品自我注意的新空间建模机制{ 驱动的各种任务中取得了巨大成功。 在本文中, 我们显示, 视觉变换器背后的关键成分, 即输入适应性、 长程和高序空间互动, 也可以在一个基于 convoluc 的框架下有效实施。 我们展示了 精密的 Galation Convolution ($\ textit{ g ⁇ textit{ n} Conv), 与 Gated convolutions 和回溯性设计进行高端空间互动。 新的操作是高度灵活和可定制的, 与各种变换变式兼容, 将两端自控的自控自控动作互动扩展为任意命令, 而不会引入大量额外计算 。 $tleglutit detraction{ glutit} Conevilital developical degress a compliamodelational develress, 我们可以构建一个名为Hal- delltroduts的新模型。