Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks.
翻译:Transformer作为CNN的替代方法,在许多模态(如文本和图像)中已被证明有效。对于3D点云Transformer,现有的研究主要集中在将其准确度推向最新水平。然而,它们的延迟落后于基于稀疏卷积的模型(慢3倍),阻碍了它们在资源受限、延迟敏感的应用中的使用(如自动驾驶)。这种低效性来自于点云的稀疏和不规则性质,而Transformer是为密集、规则的工作负载而设计的。本文提出FlatFormer,通过将空间邻近性与更好的计算规则性交换来缩小这种延迟差距。我们首先通过窗口排序将点云压平,将点分成相等大小的组,而不是相等形状的窗口。这有效地避免了昂贵的结构和填充开销。然后,我们在组内应用自我关注来提取局部特征,在交错排序轴上收集来自不同方向的特征,并将窗口移位以在组之间交换特征。FlatFormer在Waymo Open数据集上提供了最新的准确度,速度比(基于Transformer的)SST快4.6倍,比(稀疏卷积的)CenterPoint快1.4倍。这是第一个在边缘GPU上实现实时性能,速度甚至比稀疏卷积方法更快,同时在大规模基准测试中实现相当甚至更高的准确度的点云Transformer。