Recent CNN based object detectors, no matter one-stage methods like YOLO, SSD, and RetinaNe or two-stage detectors like Faster R-CNN, R-FCN and FPN are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection. 1. Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. 2. Object detection not only needs to recognize the category of the object instances but also spatially locate the position. Large downsampling factor brings large valid receptive field, which is good for image classification but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet~(4.8G FLOPs) backbone. The code will be released for the reproduction.
翻译:最近有线电视新闻网的物体探测器,无论是YOLO、SSD和RetinNe等以CNN、R-CNN、R-FCN和FPN为图像分类设计的图像网络预培训模型,通常都试图直接从图像网络预培训模型中进行微调。对于专门为物体探测设计的主干特征提取器的讨论很少。更重要的是,图像分类和物体探测任务之间存在若干差异。1. FPN和Retinnet等最近物体探测器通常涉及与图像分类任务不同的额外阶段,以便用不同尺度处理物体。2. 物体探测不仅需要识别物体实例的类别,而且需要空间定位。大型的下取样因素带来了大的有效接收场,这是很好的图像分类,但会损害物体定位能力。由于图像分类与物体探测任务之间存在差距,我们在此文件中建议DetNet,这是专门设计用于物体探测的新的主干网络。此外,DetNet包括针对传统的图像分类主干网络的额外阶段,同时将高空间分辨率维持在更深层。大型的天体分辨率,而无任何波和哨段,用于图像分类。