Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification dataset ImageNet, which incurs some additional problems: 1) The classification and detection have different degrees of sensitivity to translation, resulting in the learning objective bias; 2) The architecture is limited by the classification network, leading to the inconvenience of modification. To cope with these problems, training detectors from scratch is a feasible solution. However, the detectors trained from scratch generally perform worse than the pretrained ones, even suffer from the convergence issue in training. In this paper, we explore to train object detectors from scratch robustly. By analysing the previous work on optimization landscape, we find that one of the overlooked points in current trained-from-scratch detector is the BatchNorm. Resorting to the stable and predictable gradient brought by BatchNorm, detectors can be trained from scratch stably while keeping the favourable performance independent to the network architecture. Taking this advantage, we are able to explore various types of networks for object detection, without suffering from the poor convergence. By extensive experiments and analyses on downsampling factor, we propose the Root-ResNet backbone network, which makes full use of the information from original images. Our ScratchDet achieves the state-of-the-art accuracy on PASCAL VOC 2007, 2012 and MS COCO among all the train-from-scratch detectors and even performs better than several one-stage pretrained methods. Codes will be made publicly available at https://github.com/KimSoybean/ScratchDet.
翻译:目前,最先进的物体反对者从在大规模分类数据集图像网络上预先训练的现成网络中精细调整,这引起了一些额外的问题:(1) 分类和检测对翻译有不同程度的敏感度,导致学习客观偏差;(2) 结构受到分类网络的限制,导致修改的不便。为了解决这些问题,从零开始的培训探测器是一个可行的解决办法。然而,从零开始接受训练的探测器一般比经过训练的更差,甚至受到训练的趋同问题的影响。在本文中,我们探索从零开始对物体探测器进行强力训练。通过分析以往关于优化景观的工作,我们发现目前经过训练的从树枝探测器中忽略的一个点是BatchNorm。恢复BatchNorm带来的稳定和可预测的梯度,探测器可以从刮伤中训练,同时保持良好的性能独立到网络架构。利用这一优势,我们能够探索各种网络进行物体探测,而不会因不甚易趋同而受挫。通过对以前关于优化的轨道环境的精确性实验和分析,我们建议在2007年的SRO-R-L网络上进行较完善的原始图像。