Transformer-based models have significantly advanced natural language processing and computer vision in recent years. However, due to the irregular and disordered structure of point cloud data, transformer-based models for 3D deep learning are still in their infancy compared to other methods. In this paper we present Point Cross-Attention Transformer (PointCAT), a novel end-to-end network architecture using cross-attentions mechanism for point cloud representing. Our approach combines multi-scale features via two seprate cross-attention transformer branches. To reduce the computational increase brought by multi-branch structure, we further introduce an efficient model for shape classification, which only process single class token of one branch as a query to calculate attention map with the other. Extensive experiments demonstrate that our method outperforms or achieves comparable performance to several approaches in shape classification, part segmentation and semantic segmentation tasks.
翻译:近年来,基于Transformer的模型在自然语言处理和计算机视觉方面有了显著的进展。然而,由于点云数据的不规则和无序结构,与其他方法相比,用于3D深度学习的Transformer-based模型仍处于萌芽阶段。在本文中,我们提出了点交叉注意力Transformer(PointCAT),这是一种新颖的端到端网络架构,使用交叉注意力机制进行点云表示。我们的方法通过两个分离的Transformer分支组合了多尺度特征。为了减少多分支结构带来的计算增加,我们进一步引入了一种高效的模型用于形状分类,它只将一条分支的单一类标记作为查询,以计算另一个分支的注意力映射。广泛的实验证明,我们的方法在形状分类,部分分割和语义分割任务中的表现优于或与几种方法的性能相当。