In conventional object detection frameworks, a backbone body inherited from image recognition models extracts deep latent features and then a neck module fuses these latent features to capture information at different scales. As the resolution in object detection is much larger than in image recognition, the computational cost of the backbone often dominates the total inference cost. This heavy-backbone design paradigm is mostly due to the historical legacy when transferring image recognition models to object detection rather than an end-to-end optimized design for object detection. In this work, we show that such paradigm indeed leads to sub-optimal object detection models. To this end, we propose a novel heavy-neck paradigm, GiraffeDet, a giraffe-like network for efficient object detection. The GiraffeDet uses an extremely lightweight backbone and a very deep and large neck module which encourages dense information exchange among different spatial scales as well as different levels of latent semantics simultaneously. This design paradigm allows detectors to process the high-level semantic information and low-level spatial information at the same priority even in the early stage of the network, making it more effective in detection tasks. Numerical evaluations on multiple popular object detection benchmarks show that GiraffeDet consistently outperforms previous SOTA models across a wide spectrum of resource constraints. The source code is available at https://github.com/jyqi/GiraffeDet.
翻译:在常规物体探测框架中,一个从图像识别模型继承的骨干体从图像识别模型中提取了深潜性特征,然后一个颈部模块将这些潜在特征结合到不同尺度的信息中。由于物体检测的分辨率比图像识别要大得多,因此主干体的计算成本往往支配总推算成本。这种重背骨设计模式主要是由于历史遗留下来的,当时将图像识别模型转移到目标检测,而不是最终至最终最佳的物体探测优化设计。在这项工作中,我们表明这种模式确实导致次优化天体检测模型。为此,我们提出了一个新的重身范式,即GiraffeDet,即一个类似长颈鹿的网络,用于有效天体检测。GiraffeDet使用一个极轻的脊椎和一个非常深而大的颈部模块,鼓励在不同空间尺度之间以及不同水平的潜伏语义学之间进行密集的信息交流。这一设计范式允许探测器在网络的早期阶段处理高层次的语义/低层次空间信息。我们提议一个新型的重度重度模型,在探测任务中更为有效的探测任务中可以使用。