Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware. Furthermore, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
翻译:点云的深度学习因其在AR/VR和自主驱动中的广泛应用而日益受到更多的关注。这些应用需要低潜值和高精度,以便提供实时用户经验和确保用户安全。与常规密集的工作量不同,点云的稀少和不规律性质对在通用硬件上有效运行稀散的CNN带来严重挑战。此外,现有的2D图像稀疏的加速技术不会转化成3D点云。在本文中,我们引入了TorchSparse,这是一个高性能点云导电引擎,加速了GPUs上稀薄的计算。 TochSparse直接优化了稀散聚合的两个瓶颈:不规则的计算和数据移动。它采用适应性矩阵倍增组来进行贸易计算,以便更经常地实现1.4-1.5x的增殖速度。它还优化了数据移动,采用了矢量化、四分解和连接地点记忆存取,降低了2.7x的记忆移动成本。在三个基准数据集的7个具有代表性的模型上进行了评估,TrchSparse直接优化了稀有代表性的模式:不规则的计算,即不规则的计算和数据流动:不规则的计算;不规则的计算;不规则的计算和数据转换为1.6x,即:不规则的计算和数据流流流流流流式计算。