As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity, due to the difficulty of extracting discriminative features from irregular geometric signals. In this paper, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method. The code and data will be publicly available at https://github.com/keeganhk/PointMCD.
翻译:作为3D 对象的两个基本表达模式, 3D 点云和多视图 2D 图像记录了来自不同几何结构领域和视觉外观图像的形状信息。 在目前的深层次学习时代, 通过定制兼容的 3D 和 2D 网络架构,在处理这两个数据模式方面取得了显著的进展。 然而, 不同于多视图图像基于 2D 的视觉模型模式, 显示在若干通用 3D 形状识别基准中的主要性能, 点基于云的 3D 地理模型模式仍然因学习能力不足而严重受限制, 这是因为难以从不规则的几何信号中提取歧视特征。 在本文中, 我们探索了通过在标准的师级蒸馏流程中从深层 2D 图像编码中提取的视觉知识来提升深度 3D 的云层云值编码。 一般来说, 我们提出了一个统一的多视图跨模式蒸馏架构, 包括作为教师的预选的深层图像编码, 以及作为学生的深层次编码。 在 2D 3D 直观 的直观和直观性网络内置 领域, 我们进一步调查了透视和透视和直观 的深度数据定位 的深度数据 。