In conventional object detection frameworks, a backbone body inherited from image recognition models extracts deep latent features and then a neck module fuses these latent features to capture information at different scales. As the resolution in object detection is much larger than in image recognition, the computational cost of the backbone often dominates the total inference cost. This heavy-backbone design paradigm is mostly due to the historical legacy when transferring image recognition models to object detection rather than an end-to-end optimized design for object detection. In this work, we show that such paradigm indeed leads to sub-optimal object detection models. To this end, we propose a novel heavy-neck paradigm, GiraffeDet, a giraffe-like network for efficient object detection. The GiraffeDet uses an extremely lightweight backbone and a very deep and large neck module which encourages dense information exchange among different spatial scales as well as different levels of latent semantics simultaneously. This design paradigm allows detectors to process the high-level semantic information and low-level spatial information at the same priority even in the early stage of the network, making it more effective in detection tasks. Numerical evaluations on multiple popular object detection benchmarks show that GiraffeDet consistently outperforms previous SOTA models across a wide spectrum of resource constraints.
翻译:在常规物体探测框架中,一个从图像识别模型继承的骨干体从图像识别模型中提取深潜性特征,然后一个颈部模块将这些潜在特征结合到不同尺度的信息中。由于物体检测中的分辨率比图像识别中的分辨率大得多,因此主干体的计算成本往往占总推断成本的主导。这种重背骨设计范式主要是由于将图像识别模型转移到目标检测而不是最终至最终最佳天体探测设计的历史遗留问题。在这项工作中,我们表明这种范式确实导致次优化天体检测模型。为此,我们提出了一个新的重身范式,即GiraffeDet,即一个类似长颈鹿的网络,用于高效天体检测。GiraffeDet使用一个极轻的脊椎和一个非常深和大型的颈部模块,鼓励在不同空间尺度之间以及不同水平的潜伏语义学上进行密集的信息交流。这一设计范式使探测器能够处理高层次的语义信息和低层次空间空间信息,甚至在网络的早期阶段,从而使其在探测目标探测任务中更加有效。