Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks. Code to reproduce our results will be made publicly available.
翻译:替代CNN的变换器在很多模式(如文本和图像)中被证明是有效的。 对于 3D 点云变压器来说, 现有的努力主要侧重于将其精确度提高到最先进的水平。 但是, 它们的悬浮率落后于分散的基于卷变模型( 3x慢化), 从而阻碍了其在资源限制、 延缓性敏感应用( 如自主驱动) 中的使用。 这种低效率来自点云的稀疏和不规律性, 而变压器是设计为密集、 正常工作量的。 本文展示了 FlatFormer 来通过交换空间接近来缩小这一拉长差距, 以更精确的计算性能。 我们首先用基于窗口的排序和分区的点将点云缩小到相同大小的组( 3x慢化 ), 这有效地避免了在资源限制、 延缓度敏感度的应用程序( 如自主驾驶等) 。 我们随后在组内使用自我保存来提取本地特性, 交式的轴以收集不同方向的特征, 并将窗口转换到各组间的交换功能。 。 本文展示 Flatformermeral 将显示在更接近的更接近的更接近的平级的平流的平流的平流结果上, 。 在方向上, 将交付的平流的平流的平流的平流的平流的平流的平流速度将实现 。