Most existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, recent work on non-local neural networks and self-attention for 2D vision has shown that explicitly modeling global context and long-range interactions between positions can lead to more robust and competitive models. In this paper, we explore two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models while simultaneously significantly reducing their parameter footprint and computational cost. We also propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We achieve new state-of-the-art detection performance on KITTI and nuScenes datasets. Code is available at \url{https://github.com/AutoVision-cloud/SA-Det3D}.
翻译:现有大多数基于点球的三维天体探测器都使用类似变异式的操作器,在一个具有固定重量内核和总体全球背景的本地居民区处理信息。然而,最近关于非本地神经网络的工作和对二维视觉的自我关注表明,明确建模全球背景和位置之间的长距离互动可导致更稳健和竞争性的模式。在本文件中,我们探讨三维天体探测器的自我关注模式的两种变式,即通过增加具有自我注意特性的变动特性,在三维天体探测中进行背景建模。我们首先将配对式的自留机制纳入目前的状态BEV、Voxel和点基探测器,并显示强势基线模型的不断改进,同时大幅降低其参数足迹和计算成本。我们还提出了一个自我关注变异模式,通过在随机抽样地点上学习变形,来采集最有代表性的一组特征。这不仅使我们能够将明确的全球背景建模缩放至更大的点组合,而且还导致更具歧视性和内容的描述性描述性特征。我们的方法可以灵活地用于最精确的状态检测和精确度数据。