Processing 3D data efficiently has always been a challenge. Spatial operations on large-scale point clouds, stored as sparse data, require extra cost. Attracted by the success of transformers, researchers are using multi-head attention for vision tasks. However, attention calculations in transformers come with quadratic complexity in the number of inputs and miss spatial intuition on sets like point clouds. We redesign set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We propose our local attention unit, which captures features in a spatial neighborhood. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. Finally, to mitigate the non-heterogeneity of point clouds, we propose an efficient Multi-Scale Tokenization (MST), which extracts scale-invariant tokens for attention operations. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods while requiring significantly fewer computations. Our proposed architecture predicts segmentation labels with around half the latency and parameter count of the previous most efficient method with comparable performance. The code is available at https://github.com/YigeWang-WHU/CloudAttention.
翻译:高效处理 3D 数据始终是一个挑战。 大型点云的空间操作,作为稀有数据存储,需要额外的成本。 在变压器的成功吸引下, 研究人员正在将多头注意力用于视觉任务。 但是, 变压器的注意力计算伴随着输入量的二次复杂度和像点云这样的集体上的空间直觉缺失。 我们重新设计了这项工作中的变压器, 并将它们纳入形状分类以及部分和场景分割的等级框架。 我们建议本地关注单位, 它捕捉空间周边的特征。 我们还通过利用取样和组合对高效和动态的全球交叉关注进行计算。 最后, 为了减轻点云的非异质性, 我们建议采用高效的多层调制调(MST), 提取备受关注操作的标码。 拟议的等级模型在平均精度和输出结果与先前的分解方法相同时, 需要大量计算。 我们提议的架构预测的分解标签, 大约一半的悬浮度和参数值。 可用于 MAC/WHE/ 最高效的性计算方法。