3D object detection from monocular image(s) is a challenging and long-standing problem of computer vision. To combine information from different perspectives without troublesome 2D instance tracking, recent methods tend to aggregate multiview feature by sampling regular 3D grid densely in space, which is inefficient. In this paper, we attempt to improve multi-view feature aggregation by proposing a learnable keypoints sampling method, which scatters pseudo surface points in 3D space, in order to keep data sparsity. The scattered points augmented by multi-view geometric constraints and visual features are then employed to infer objects location and shape in the scene. To make up the limitations of single frame and model multi-view geometry explicitly, we further propose a surface filter module for noise suppression. Experimental results show that our method achieves significantly better performance than previous works in terms of 3D detection (more than 0.1 AP improvement on some categories of ScanNet). The code will be publicly available.
翻译:从单体图像中探测 3D 对象是一个具有挑战性和长期存在的计算机视觉问题。为了将不同角度的信息结合起来,而不引起2D实例的跟踪,最近的方法往往会通过对空间密集的常规 3D 电网取样来综合多视特征,而这种取样效率低。在本文件中,我们试图改进多视特征的汇总,方法是提出一种可学习的键点取样方法,在3D 空间中散布假表面点,以保持数据宽度。然后,利用多视几何限制和视觉特征增强的分散点来推断现场物体的位置和形状。为了明确弥补单一框架和多视模型几何学的局限性,我们进一步提议用一个表面过滤模块来抑制噪音。实验结果表明,我们的方法在3D 探测方面比以前的工作(在扫描网的某些类别上超过0.1 AP的改进)取得显著的成绩。代码将公开提供。