We analyzed the network structure of real-time object detection models and found that the features in the feature concatenation stage are very rich. Applying an attention module here can effectively improve the detection accuracy of the model. However, the commonly used attention module or self-attention module shows poor performance in detection accuracy and inference efficiency. Therefore, we propose a novel self-attention module, called 2D local feature superimposed self-attention, for the feature concatenation stage of the neck network. This self-attention module reflects global features through local features and local receptive fields. We also propose and optimize an efficient decoupled head and AB-OTA, and achieve SOTA results. Average precisions of 49.0% (71FPS, 14ms), 46.1% (85FPS, 11.7ms), and 39.1% (107FPS, 9.3ms) were obtained for large, medium, and small-scale models built using our proposed improvements. Our models exceeded YOLOv5 by 0.8% -- 3.1% in average precision.
翻译:我们分析了实时物体探测模型的网络结构, 发现地貌连接阶段的特征非常丰富。 在此应用一个关注模块可以有效提高模型的检测准确性。 但是, 常用的注意模块或自我注意模块显示在检测准确性和推断效率方面表现不佳。 因此, 我们提议为颈项网络的特征连接阶段, 建立一个名为 2D 本地特性超导自留的新自留模块。 这个自留模块通过本地特征和本地可接收字段反映全球特征。 我们还提议并优化高效拆分头和AB- OTA, 并实现SOTA结果。 平均精度为49.0%( 71 FPS, 14ms), 46.1%(85 FPS, 11.7m), 和39.1%( 107 FPS, 9.3m) 的大型、 中型和小型模型, 利用我们提议的改进而建成的模型超过YOLOv5, 0.8% -- -- 平均精确度为3.1% 。