We analyzed the network structure of real-time object detection models and found that the features in the feature concatenation stage are very rich. Applying an attention module here can effectively improve the detection accuracy of the model. However, the commonly used attention module or self-attention module shows poor performance in detection accuracy and inference efficiency. Therefore, we propose a novel self-attention module, called 2D local feature superimposed self-attention, for the feature concatenation stage of the neck network. This self-attention module reflects global features through local features and local receptive fields. We also propose and optimize an efficient decoupled head and AB-OTA, and achieve SOTA results. Average precisions of 49.0\% (66.2 FPS), 46.1\% (80.6 FPS), and 39.1\% (100 FPS) were obtained for large, medium, and small-scale models built using our proposed improvements. Our models exceeded YOLOv5 by 0.8\% -- 3.1\% in average precision.
翻译:我们分析了实时物体探测模型的网络结构, 发现地貌连接阶段的特征非常丰富。 在此应用一个关注模块可以有效提高模型的检测准确性。 但是, 常用的注意模块或自我注意模块在检测准确性和推断效率方面表现不佳。 因此, 我们提议为颈项网络的特征连接阶段建立一个新型的自我注意模块, 称为 2D 本地特性超导自控模块。 这个自我注意模块通过本地特征和本地可接收域反映全球特征。 我们还提议并优化高效拆分头和AB- OTA, 并实现SOTA结果。 利用我们提议的改进, 获得了大型、 中型 和 小型 模型的平均精确度为 49.0 ⁇ ( 662 FPS)、 46.1 ⁇ (80.6 FPS) 和 39.1 ⁇ (100 FPS) 。 我们的模型比YOLOV5 高出0.8 ⁇ -- -- 3.1 ⁇ 平均精确度超过 YOLOV5。