We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as $k$ anchor boxes pre-defined on all grids of image feature map of size $H\times W$. In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location. By eliminating $HWk$ (up to hundreds of thousands) hand-designed object candidates to $N$ (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$ training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.
翻译:我们提出Sprass R-CNN,这是在图像中探测物体的简单方法。关于物体探测的现有工作严重依赖密集对象候选人,例如,在大小为$H/times W$的所有图像特征图网格上预先界定的美元定位箱。然而,在我们的方法中,一套固定的、稀疏的学习的物体提案,总长度为$00美元,用于对目标识别头进行分类和定位。通过将价值为$HWk$(高达数十万美元)的手工设计的对象候选人删除为$(例如,100美元)可学习的建议,Sprass R-CNN完全避免了与目标候选人设计和许多到一个标签任务有关的所有努力。更重要的是,最后预测是直接产出,没有非最大抑制程序后的程序。简洁的R-CNN显示精确性、运行时间和培训趋同率,与具有挑战性的COCO数据集的既定探测器基准相当,例如,在标准为$45美元的培训时间表中达到45美元,使用ResNet-50 FCNPN模式运行22 fps。我们希望我们的工作能够在最密集的S 探测器上重新思考。