PointMCD：通过多视角跨模态蒸馏增强深度点云编码器的三维形状识别 (PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition)

As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity, due to the difficulty of extracting discriminative features from irregular geometric signals. In this paper, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method. The code and data will be publicly available at https://github.com/keeganhk/PointMCD.

翻译：作为三维物体的两种基本表征形式，三维点云和多视角二维图像分别记录了几何结构和视觉特征的不同领域中的形状信息。在当前的深度学习时代，通过分别定制兼容的三维和二维网络架构，已经在处理这两种数据模态方面取得了显著的进展。但是，与多视角基于图像的二维视觉建模范例不同，在几何结构不规则的三维点云信号中提取有区分性的特征仍然具有很高的难度，因此点云为基础的三维几何建模范例在学习能力方面仍然受到极大的限制。本文旨在通过将从深度二维图像编码器中提取的视觉知识转移至标准的教师-学生蒸馏工作流程来探索提高深度三维点云编码器的可能性。我们提出了PointMCD（Point-wise Multi-view Cross-modal Distillation，点云的多视角跨模态蒸馏），这是一个统一的多视角跨模态蒸馏架构，包括预训练的深度图像编码器作为教师，以及深度点编码器作为学生。为了在二维视觉和三维几何域之间执行异构特征对齐，我们进一步研究了可见性感知特征投影（Visibility-aware Feature Projection，VAFP），通过VAFP，可以将点级嵌入合理聚合成视角特定的几何描述符。通过成对对齐多视角视觉和几何描述符，我们可以获得更强大的深度点编码器，而不需要耗尽精力和复杂网络修改。实验结果表明，我们的方法在三维形状分类，零部件分割和无监督学习方面均得到了强有力的验证。代码和数据将在https://github.com/keeganhk/PointMCD上公开发布。