3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing the 3D point cloud with 2D images that often have richer color and fewer noises. However, due to the heterogeneous geometrics of the 2D and 3D representations, it prevents us from applying off-the-shelf neural networks to achieve multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify 3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization of object queries for bridging 3D and 2D spaces, which unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by point-to-patch projections which further strengthen the correlations between images and points. Moreover, BrT works seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
翻译:3D天体探测是计算机视觉中的一个关键研究课题,计算机视觉通常使用3D点云作为常规设置的投入。最近,出现了利用多种输入数据来源的趋势,例如利用3D点云与2D点云补充往往有更丰富颜色和更少噪音的2D图像。然而,由于2D和3D表示式的多式几何性,3D天体探测使我们无法应用现成的神经网络实现多式联运。为此,我们提议了三D天体探测的端到端结构BRT。BRT是简单而有效的,它学会从点和图像补补补点中识别3D点云和2D天体框。BRT的一个关键要素是利用3D和2D空间的连接对象查询,从而将变异体的数据显示不同的数据源。我们采用了通过点到端预测实现的特征汇总形式,从而进一步加强图像和点之间的联系。此外,BBLT用多视图图像对点云进行无缝的利用。我们实验性地展示了BrNet和SARD的数据方法。