Inspired by the success of recent vision transformers and large kernel design in convolutional neural networks (CNNs), in this paper, we analyze and explore essential reasons for their success. We claim two factors that are critical for 3D large-scale scene understanding: a larger receptive field and operations with greater non-linearity. The former is responsible for providing long range contexts and the latter can enhance the capacity of the network. To achieve the above properties, we propose a simple yet effective long range pooling (LRP) module using dilation max pooling, which provides a network with a large adaptive receptive field. LRP has few parameters, and can be readily added to current CNNs. Also, based on LRP, we present an entire network architecture, LRPNet, for 3D understanding. Ablation studies are presented to support our claims, and show that the LRP module achieves better results than large kernel convolution yet with reduced computation, due to its nonlinearity. We also demonstrate the superiority of LRPNet on various benchmarks: LRPNet performs the best on ScanNet and surpasses other CNN-based methods on S3DIS and Matterport3D. Code will be made publicly available.
翻译:受最近视觉变压器和革命性神经网络大型内核设计的成功启发,本文件分析并探讨其成功的基本原因。我们声称对3D大规模场景理解至关重要的两个因素:一个更大的可接受场和操作,其非线性更大。前者负责提供远程背景,后者可以提高网络能力。为了实现上述特性,我们建议使用放大最大集合,提供一个简单而有效的远程集合模块,提供具有大适应性可接受场的网络。LRP的参数很少,而且可以随时添加到当前的CNN中。此外,根据LRP,我们提出了整个网络结构,即LRPNet,用于3D理解。提出通缩研究是为了支持我们的主张,并表明LRP模块取得比大内核变迁更好的结果,但计算却由于非线性而降低。我们还展示了LRPNet在各种基准上的优越性:LRPNet在ScampNet上表现最佳,超过S3DIS和MM3D可公开使用的其他CNN方法。