FlatFormer: 压平窗口关注机制提高点云Transformer的效率 (FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer)

Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks.

翻译：Transformer作为CNN的替代方法，在许多模态（如文本和图像）中已被证明有效。对于3D点云Transformer，现有的研究主要集中在将其准确度推向最新水平。然而，它们的延迟落后于基于稀疏卷积的模型（慢3倍），阻碍了它们在资源受限、延迟敏感的应用中的使用（如自动驾驶）。这种低效性来自于点云的稀疏和不规则性质，而Transformer是为密集、规则的工作负载而设计的。本文提出FlatFormer，通过将空间邻近性与更好的计算规则性交换来缩小这种延迟差距。我们首先通过窗口排序将点云压平，将点分成相等大小的组，而不是相等形状的窗口。这有效地避免了昂贵的结构和填充开销。然后，我们在组内应用自我关注来提取局部特征，在交错排序轴上收集来自不同方向的特征，并将窗口移位以在组之间交换特征。FlatFormer在Waymo Open数据集上提供了最新的准确度，速度比（基于Transformer的）SST快4.6倍，比（稀疏卷积的）CenterPoint快1.4倍。这是第一个在边缘GPU上实现实时性能，速度甚至比稀疏卷积方法更快，同时在大规模基准测试中实现相当甚至更高的准确度的点云Transformer。

相关内容

点云

关注 48

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【CVPR2023】Mask3D:通过学习掩码3D先验对2D视觉transformer进行预训练

专知会员服务

24+阅读 · 2023年4月9日

CVPR2022 | Sparse Transformer刷新点云目标检测的SOTA

专知会员服务

25+阅读 · 2022年3月9日

基于粗粒度数据流架构的稀疏卷积神经网络加速

专知会员服务

23+阅读 · 2021年7月15日

【华为-诺亚实验室】动态BERT, Dynamic BERT with Adaptive Width and Depth

专知会员服务

24+阅读 · 2020年4月13日