Since the recent success of Vision Transformers (ViTs), explorations toward transformer-style architectures have triggered the resurgence of modern ConvNets. In this work, we explore the representation ability of DNNs through the lens of interaction complexities. We empirically show that interaction complexity is an overlooked but essential indicator for visual recognition. Accordingly, a new family of efficient ConvNets, named MogaNet, is presented to pursue informative context mining in pure ConvNet-based models, with preferable complexity-performance trade-offs. In MogaNet, interactions across multiple complexities are facilitated and contextualized by leveraging two specially designed aggregation blocks in both spatial and channel interaction spaces. Extensive studies are conducted on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. The results demonstrate that our MogaNet establishes new state-of-the-art over other popular methods in mainstream scenarios and all model scales. Typically, the lightweight MogaNet-T achieves 80.0\% top-1 accuracy with only 1.44G FLOPs using a refined training setup on ImageNet-1K, surpassing ParC-Net-S by 1.4\% accuracy but saving 59\% (2.04G) FLOPs.
翻译:自最近视觉变异器(View Greeners)的成功以来,对变压器式结构的探索引发了现代ConvNet的再现。 在这项工作中,我们通过互动复杂性的透镜探索DNNs的代表性能力。我们从经验上表明,互动的复杂性是一个被忽视但重要的指标,以视觉识别为目的。因此,一个名为MogaNet的高效ConNets的新大家庭,提出在纯ConvNet基础上的模型中进行信息化背景采矿,并采用更精细的图像网络-1配置,只利用空间和频道互动空间的两个专门设计的聚合块,促进多种复杂的相互作用和背景化。进行了关于图像网络分类、COCOCO天体探测和AD20K语义分割任务的广泛研究。结果显示,我们的MegaNet在主流情景和所有模型规模中建立了超越其他流行方法的新的状态。典型,轻量型的MegaNet-T在图像网络上实现了80. 0 ⁇ 顶级-1的精确度,只有1.44G FLOPs, 使用在图像网-1K上经过改进的培训设置,超过PAC-OP-S. 1.4的精确度。