The past few years have witnessed the prevalence of self-supervised representation learning within the language and 2D vision communities. However, such advancements have not been fully migrated to the community of 3D point cloud learning. Different from previous pre-training pipelines for 3D point clouds that generally fall into the scope of either generative modeling or contrastive learning, in this paper, we investigate a translative pre-training paradigm, namely PointVST, driven by a novel self-supervised pretext task of cross-modal translation from an input 3D object point cloud to its diverse forms of 2D rendered images (e.g., silhouette, depth, contour). Specifically, we begin with deducing view-conditioned point-wise embeddings via the insertion of the viewpoint indicator, and then adaptively aggregate a view-specific global codeword, which is further fed into the subsequent 2D convolutional translation heads for image generation. We conduct extensive experiments on common task scenarios of 3D shape analysis, where our PointVST shows consistent and prominent performance superiority over current state-of-the-art methods under diverse evaluation protocols. Our code will be made publicly available.
翻译:在过去几年里,语言界和2D视觉界普遍开展了自我监督的代表学习,然而,这些进步尚未完全迁移到3D点云学习社区。与以前为3D点云学习提供的培训前管道不同,前者一般属于典型模型或对比性学习的范围,我们在本文件中调查了一个过渡性培训前模式,即PointVST, 其驱动力是一个新的自我监督的托辞,即跨模式翻译,从输入3D对象点云到其2D成像的多种形式(例如硅、深度、轮廓)。具体地说,我们首先通过插入观点指标来显示有视觉限制的点嵌入点,然后以适应性的方式汇总一个有视觉特点的全球代码,然后进一步反馈到随后的2D革命翻译头来生成图像。我们对3D形分析的共同任务情景进行了广泛的实验,我们的PointVST将显示在不同的评估协议下对当前状态方法的一贯和突出的优势。我们的代码将公开提供。