To date, various 3D scene understanding tasks still lack practical and generalizable pre-trained models, primarily due to the intricate nature of 3D scene understanding tasks and their immense variations introduced by camera views, lighting, occlusions, etc. In this paper, we tackle this challenge by introducing a spatio-temporal representation learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion. Inspired by how infants learn from visual data in the wild, we explore the rich spatio-temporal cues derived from the 3D data. Specifically, STRL takes two temporally-correlated frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly. To corroborate the efficacy of STRL, we conduct extensive experiments on three types (synthetic, indoor, and outdoor) of datasets. Experimental results demonstrate that, compared with supervised learning methods, the learned self-supervised representation facilitates various models to attain comparable or even better performances while capable of generalizing pre-trained models to downstream tasks, including 3D shape classification, 3D object detection, and 3D semantic segmentation. Moreover, the spatio-temporal contextual cues embedded in 3D point clouds significantly improve the learned representations.
翻译:迄今为止,各种3D场面理解任务仍然缺乏实用和可概括的预先培训模型,主要原因是3D场面理解任务的性质复杂,以及摄像、照明、隐蔽等带来的巨大变化。在本文件中,我们通过引入一个时空代表学习框架来应对这一挑战,能够以自我监督的方式从未贴标签的3D点云中学习。在婴儿如何从野生视觉数据中学习的启发下,我们探索了来自3D数据的丰富的时空提示。具体地说,STRL从3D点云序列中取出两个与时间有关的框架作为输入,随着空间数据增强而转换,并学习动态代表的自我监督。为了验证STRL的功效,我们在三种类型的数据集(合成、室内和户外)上进行了广泛的实验。实验结果表明,与监督的学习方法相比,经过学习的自我监督演示为各种模型提供了在3D点云层中实现可比较或更好的性能,同时能够进行空间数据增强的变式代表,并学习进化地段三层结构,包括总体探测前的深度分析,三层层结构结构,以及深层分析前阶段。