The manual annotation for large-scale point clouds costs a lot of time and is usually unavailable in harsh real-world scenarios. Inspired by the great success of the pre-training and fine-tuning paradigm in both vision and language tasks, we argue that pre-training is one potential solution for obtaining a scalable model to 3D point cloud downstream tasks as well. In this paper, we, therefore, explore a new self-supervised learning method, called Mixing and Disentangling (MD), for 3D point cloud representation learning. As the name implies, we mix two input shapes and demand the model learning to separate the inputs from the mixed shape. We leverage this reconstruction task as the pretext optimization objective for self-supervised learning. There are two primary advantages: 1) Compared to prevailing image datasets, eg, ImageNet, point cloud datasets are de facto small. The mixing process can provide a much larger online training sample pool. 2) On the other hand, the disentangling process motivates the model to mine the geometric prior knowledge, eg, key points. To verify the effectiveness of the proposed pretext task, we build one baseline network, which is composed of one encoder and one decoder. During pre-training, we mix two original shapes and obtain the geometry-aware embedding from the encoder, then an instance-adaptive decoder is applied to recover the original shapes from the embedding. Albeit simple, the pre-trained encoder can capture the key points of an unseen point cloud and surpasses the encoder trained from scratch on downstream tasks. The proposed method has improved the empirical performance on both ModelNet-40 and ShapeNet-Part datasets in terms of point cloud classification and segmentation tasks. We further conduct ablation studies to explore the effect of each component and verify the generalization of our proposed strategy by harnessing different backbones.
翻译:大型点云的手动批注花费了很多时间, 通常在严酷的现实世界情景中无法使用。 受预培训和微调模式在视觉和语言任务中的巨大成功启发, 我们争论说, 预培训是获取可缩放模型到 3D 点云下游任务的一个潜在解决方案。 因此, 在本文中, 我们探索一种新的自监督学习方法, 名为 Mixing and Disentangling (MD), 用于学习 3D 点云流的演示。 正如名称所暗示的那样, 我们混合两个输入形状, 并要求模型学习将输入从混合形状中分离出来。 我们利用这一重建任务作为自我监督学习的借口优化目标。 有两大优点:(1) 将当前图像模型比对 3D点云下游任务进行缩放。 混合过程可以提供更大的在线培训前样本库。 2 在另一边, 衰变进程可以激励模型去定位之前的知识, 我们从简单、 关键点, 将模型学习的模型分解解析, 将一个模型比重的模型转换成一个基础任务。