CNN 变换式物体探测混合模型 (CNN-transformer mixed model for object detection)

Object detection, one of the three main tasks of computer vision, has been used in various applications. The main process is to use deep neural networks to extract the features of an image and then use the features to identify the class and location of an object. Therefore, the main direction to improve the accuracy of object detection tasks is to improve the neural network to extract features better. In this paper, I propose a convolutional module with a transformer[1], which aims to improve the recognition accuracy of the model by fusing the detailed features extracted by CNN[2] with the global features extracted by a transformer and significantly reduce the computational effort of the transformer module by deflating the feature mAP. The main execution steps are convolutional downsampling to reduce the feature map size, then self-attention calculation and upsampling, and finally concatenation with the initial input. In the experimental part, after splicing the block to the end of YOLOv5n[3] and training 300 epochs on the coco dataset, the mAP improved by 1.7% compared with the previous YOLOv5n, and the mAP curve did not show any saturation phenomenon, so there is still potential for improvement. After 100 rounds of training on the Pascal VOC dataset, the accuracy of the results reached 81%, which is 4.6 better than the faster RCNN[4] using resnet101[5] as the backbone, but the number of parameters is less than one-twentieth of it.

翻译：计算机视觉的三大任务之一,即计算机视觉的三大任务之一,已在各种应用中应用。主要程序是使用深神经网络提取图像的特性,然后使用这些特性来识别对象的等级和位置。因此,提高物体探测任务准确性的主要方向是改进神经网络以更好地提取特性。在本文件中,我提议了一个带有一个变压器的组合模块,目的是通过将CNN[1] 提取的详细功能与由变压器提取的全球参数结合,提高模型的识别准确性。主要程序是使用深神经网络网络来提取图像的特征,然后用这些功能来识别某个对象的特性和位置。因此,提高物体探测任务准确性的主要执行步骤是同步的下调,以便减少特性地图的大小,然后进行自用计算和放大,最后与初始输入相近。在将区块切至YOLOv5n[3]和在科数据集上培训300个顶部,与变压模块相比,与先前的YOLOv5n平平平平平平平调相比, mAP提高了1.7%,而MAP的正平平平平平轨道数据在使用后没有显示更精确。