Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity. Previous work has taken the first step to scale up the kernel size from 3x3x3 to 7x7x7 by introducing block-shared weights. However, to reduce the feature variations within a block, it only employs modest block size and fails to achieve larger kernels like the 21x21x21. To address this issue, we propose a new method, called LinK, to achieve a wider-range perception receptive field in a convolution-like manner with two core designs. The first is to replace the static kernel matrix with a linear kernel generator, which adaptively provides weights only for non-empty voxels. The second is to reuse the pre-computed aggregation results in the overlapped blocks to reduce computation complexity. The proposed method successfully enables each voxel to perceive context within a range of 21x21x21. Extensive experiments on two basic perception tasks, 3D object detection and 3D semantic segmentation, demonstrate the effectiveness of our method. Notably, we rank 1st on the public leaderboard of the 3D detection benchmark of nuScenes (LiDAR track), by simply incorporating a LinK-based backbone into the basic detector, CenterPoint. We also boost the strong segmentation baseline's mIoU with 2.7% in the SemanticKITTI test set. Code is available at https://github.com/MCG-NJU/LinK.
翻译:将二维大核的成功推广到三维感知是具有挑战性的,因为:1.处理三维数据的开销呈立方增长;2.由于数据稀缺和稀疏性,优化的难度增大。先前的研究已经通过引入块共享权重将核大小从3x3x3扩大到7x7x7,迈出了规模化的第一步。然而,为了减少块内特征的变化,它只采用了适度的块大小,难以实现像21x21x21这样的更大核。为了解决这个问题,我们提出了一种新的方法,称为LinK,以类卷积的方式实现更宽范围的感受野,并具有两个核心设计。第一个是用线性核生成器替换静态的核矩阵,适应性地为非空体素提供权重。第二个是重复使用重叠块中预计算的聚合结果,以减少计算复杂度。所提出的方法成功地使每个体素能够感知在21x21x21范围内的上下文。对于两个基本感知任务,即三维物体检测和三维语义分割,进行了广泛的实验,证明了我们方法的有效性。值得注意的是,我们在公共排行榜上排名第一,其中包括nuScenes(LiDAR)、通过将基本探测器CenterPoint与LinK的骨干相结合。我们还在SemanticKITTI测试集中的强分割基线上提高了2.7%的mIoU。代码可在https://github.com/MCG-NJU/LinK上获得。