Existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, non-local neural networks and self-attention for 2D vision have shown that explicitly modeling long-range interactions can lead to more robust and competitive models. In this paper, we propose two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models of up to 1.5 3D AP while simultaneously reducing their parameter footprint and computational cost by 15-80% and 30-50%, respectively, on the KITTI validation set. We next propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We show our proposed method improves 3D object detection performance on KITTI, nuScenes and Waymo Open datasets. Code is available at https://github.com/AutoVision-cloud/SA-Det3D.
翻译:现有基于点球的 3D 目标探测器使用类似变式的操作器, 在一个拥有固定重量内核和综合全球环境的本地居民区处理信息。 但是, 非本地神经网络和对 2D 愿景的自我关注显示, 明确模拟远程互动可以导致更强大和更具竞争性的模式。 在本文中, 我们提出两种在 3D 目标探测中进行上下文建模自我关注的变式, 方法是增加具有自我注意特性的变异特征 。 我们首先将配对的自我关注机制纳入当前状态BEV、 voxel 和 点- 自动探测器, 并显示在高达 1.5 3D 的强基线模型上不断改进, 同时在 KITTI 验证集上将其参数足迹和计算成本分别降低15- 80% 和 30- 50 。 我们接下来提出一个自我保护变异变量变异变量, 通过随机选取的选样位置, 不仅允许我们将明确的全球环境建模放大到更大的点- Dloveroad 3, 也导致更精确的测算方法。