Recent advances in 2D CNNs and vision transformers (ViTs) reveal that large kernels are essential for enough receptive fields and high performance. Inspired by this literature, we examine the feasibility and challenges of 3D large-kernel designs. We demonstrate that applying large convolutional kernels in 3D CNNs has more difficulties in both performance and efficiency. Existing techniques that work well in 2D CNNs are ineffective in 3D networks, including the popular depth-wise convolutions. To overcome these obstacles, we present the spatial-wise group convolution and its large-kernel module (SW-LK block). It avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, i.e., LargeKernel3D, yields non-trivial improvements on various 3D tasks, including semantic segmentation and object detection. Notably, it achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1st on the nuScenes LIDAR leaderboard. It is further boosted to 74.2% NDS with a simple multi-modal fusion. LargeKernel3D attains comparable or superior results than its CNN and transformer counterparts. For the first time, we show that large kernels are feasible and essential for 3D networks.
翻译:2D CNN和视觉变压器(ViTs)最近的进展显示,大型内核对于足够接受的字段和高性能都至关重要。根据这些文献,我们审视了3D大型内核设计的可行性和挑战。我们证明,在3D CNN中应用大型革命内核的大型内核在性能和效率方面都有更大的困难。在2D CNN网络中运行良好的现有技术在3D网络中是无效的,包括流行的深度共振。为了克服这些障碍,我们介绍了空间集团的组合及其大型内核模块(SW-LK块)。它避免了天真的3D大型内核内核设计的优化和效率问题。我们的大内核3DCNN网络(即大Kern3D)在性能和效率两方面都有更大的改进,包括语系分解和对象探测。值得注意的是,ScensnNetv2 的S-NDEVER首次实现了73.和72.NDS NDVER 的大规模测试基准,在SUDS 3 上排名第1级的高级高级DSDA 显示其高级结果。