In this paper, we propose a simple and general framework for self-supervised point cloud representation learning. Human beings understand the 3D world by extracting two levels of information and establishing the relationship between them. One is the global shape of an object, and the other is the local structures of it. However, few existing studies in point cloud representation learning explored how to learn both global shapes and local-to-global relationships without a specified network architecture. Inspired by how human beings understand the world, we utilize knowledge distillation to learn both global shape information and the relationship between global shape and local structures. At the same time, we combine contrastive learning with knowledge distillation to make the teacher network be better updated. Our method achieves the state-of-the-art performance on linear classification and multiple other downstream tasks. Especially, we develop a variant of ViT for 3D point cloud feature extraction, which also achieves comparable results with existing backbones when combined with our framework, and visualization of the attention maps show that our model does understand the point cloud by combining the global shape information and multiple local structural information, which is consistent with the inspiration of our representation learning method. Our code will be released soon.
翻译:在本文中,我们为自我监督的点云代表学习提出了一个简单和一般的框架。 人类通过提取两个层次的信息和建立它们之间的关系来理解3D世界。 一个是对象的全球形状, 另一个是它的当地结构。 但是,在点云代表学习中,很少有现有研究探索如何在没有特定网络结构的情况下学习全球形状和地方到全球关系。 受人类如何理解世界的启发, 我们利用知识蒸馏来学习全球形状信息以及全球形状与地方结构之间的关系。 同时, 我们把对比学习与知识蒸馏结合起来, 使教师网络得到更好的更新。 我们的方法在线性分类和多项其他下游任务上达到了最先进的表现。 特别是, 我们开发了3D点云特征提取的VIT变种, 与现有的骨干相匹配, 与我们的框架相结合, 关注地图的可视化显示我们的模型通过结合全球形状信息和多个地方结构信息来理解点云。 我们的代码将很快被发布, 。