In this paper, we present an efficient and high-performance neural architecture, termed Point-Voxel Transformer (PVT)for 3D deep learning, which deeply integrates both 3D voxel-based and point-based self-attention computation to learn more discriminative features from 3D data. Specifically, we conduct multi-head self-attention (MSA) computation in voxels to obtain the efficient learning pattern and the coarse-grained local features while performing self-attention in points to provide finer-grained information about the global context. In addition, to reduce the cost of MSA computation with high efficiency, we design a cyclic shifted boxing scheme by limiting the MSA computation to non-overlapping local box and also preserving cross-box connection. Evaluated on classification benchmark, our method not only achieves state-of-the-art accuracy of 94.0% (no voting) but outperforms previous Transformer-based models with 7x measured speedup on average. On part and semantic segmentation, our model also obtains strong performance(86.5% and 68.2% mIoU, respectively). For 3D object detection task, we replace the primitives in Frustrum PointNet with PVT block and achieve an improvement of 8.6% AP.
翻译:在本文中,我们展示了一个高效和高性能的神经结构,称为3D深层学习的点-福克斯变异器(PVT),它深度整合了3D voxel基和点-自控计算方法,从 3D 数据中学习更加歧视的特征。具体地说,我们用三D 数据进行多头自控(MSA)计算,以获得高效的学习模式和粗粒的本地特征,同时进行点自控,以提供精确的关于全球环境的信息。此外,为了降低管理协议计算的成本,我们设计了一个循环式转换箱计划,将管理协议的计算限制在不重叠的地方框和保存交叉框连接。根据分类基准评估,我们的方法不仅实现了94.0%(无投票)的最新准确度,而且比以往的基于变压器模型高出平均7x的速率。在部分和语义分割方面,我们的模型还取得了很强的性能(86.5 % 和68.2% 移动框框), 并分别取代了FIstrainstru 目标(我们B) 3 和FIstrainstrestrestrestru 。