Existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, non-local neural networks and self-attention for 2D vision have shown that explicitly modeling long-range interactions can lead to more robust and competitive models. In this paper, we propose two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models of up to 1.5 3D AP while simultaneously reducing their parameter footprint and computational cost by 15-80\% and 30-50\%, respectively, on the KITTI validation set. We next propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We show our proposed method improves 3D object detection performance on KITTI, nuScenes and Waymo Open datasets. Code is available at \url{https://github.com/AutoVision-cloud/SA-Det3D}.
翻译:现有基于点球的 3D 目标探测器使用类似变式的操作器, 在一个拥有固定重量内核和总体全球环境的本地街区处理信息。 但是, 非本地神经网络和对 2D 愿景的自我关注显示, 明确模拟远程互动可以导致更稳健和更具竞争性的模式。 在本文中, 我们提出两种在 3D 目标探测中进行背景建模的自我关注变式, 方法是增加具有自我注意特性的变异特征。 我们首先将配对的自我关注机制纳入当前状态BEV、 voxel和点基探测器, 并显示对高达1.5 3D AP 的强基线模型的不断改进,同时将其参数足迹和计算成本分别降低15- 80 ⁇ 和 30- 50 ⁇ 。 我们提出一个自我保护变异变量变异变量, 通过随机抽样地点的变形来采集最有代表性的特性的子集。 这不仅仅允许我们把明确的全球环境建模放大到更大的点- Dlovedal 目标、 Val- 3A 最具有更清晰的检测功能和精确性数据。