Effective learning of spatial-temporal information within a point cloud sequence is highly important for many down-stream tasks such as 4D semantic segmentation and 3D action recognition. In this paper, we propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations from dynamic 3D point cloud sequences. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. Our STSA module is introduced to capture the spatial-temporal context information across adjacent frames, while the RE module is proposed to aggregate features across neighbors to enhance the resolution of feature maps. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition. Extensive experiments on three benchmarks show that our PST2 outperforms existing methods on all datasets. The effectiveness of our STSA and RE modules have also been justified with ablation experiments.
翻译:在点云序列中有效学习时空信息对于许多下游任务,例如4D语义分解和3D动作识别等,非常重要。在本文件中,我们提议了一个名为Point空间时空变换器(PST2)的新框架,以学习动态3D点云序列的空间时空表达方式。我们的PST2由两个主要模块组成:SPA-时空自我意识模块和分辨率嵌入模块。我们采用STSA模块是为了捕捉相邻框架的空间时空背景信息,而RE模块则建议将周边的特征汇总起来,以加强地貌图的分辨率。我们测试我们的PST2的有效性,在点云序列上执行两项不同的任务,即4D语义分解和3D动作识别。关于三个基准的广泛实验表明,我们的PST2超越了所有数据集的现有方法。我们的STA和RE模块的有效性也与通缩缩实验是有道理的。