In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2\% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2\% with 28\% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4\% top-1 accuracy on ImageNet-1K. The code and models are publicly available at https://t.ly/_Vu9.
翻译:为实现不断提高的准确性,通常会开发大型和复杂的神经网络。这些模型需要高计算资源,因此无法在边缘设备上部署。由于资源效率高的通用网络在几个应用领域的有用性,因此建设资源效率高的通用网络非常有益。在这项工作中,我们努力有效地将CNN和变异器模型的优势结合起来,并提出一个新的高效混合结构结构的优点。我们特别在EdgeNeXt,我们引入了将输入气压分解成多个频道组的深度转换关注(SDTA)编码器,将输入气压分解成多个频道组,并使用深度和深度的组合,同时在频道的方位上进行自我注意,以隐含地增加接收字段和编码多尺度特征。我们在分类、检测和分解任务方面的广泛实验,揭示了拟议方法的优点,在相对较低的计算要求下,优于最新工艺方法。我们的EdgeNeXt模型在图像Net-1K上实现了712“1”最高1精确度的准确度,在MVV4+5.6模型上实现了2.2“28”的绝对收益。