Feature fusion and similarity computation are two core problems in 3D object tracking, especially for object tracking using sparse and disordered point clouds. Feature fusion could make similarity computing more efficient by including target object information. However, most existing LiDAR-based approaches directly use the extracted point cloud feature to compute similarity while ignoring the attention changes of object regions during tracking. In this paper, we propose a feature fusion network based on transformer architecture. Benefiting from the self-attention mechanism, the transformer encoder captures the inter- and intra- relations among different regions of the point cloud. By using cross-attention, the transformer decoder fuses features and includes more target cues into the current point cloud feature to compute the region attentions, which makes the similarity computing more efficient. Based on this feature fusion network, we propose an end-to-end point cloud object tracking framework, a simple yet effective method for 3D object tracking using point clouds. Comprehensive experimental results on the KITTI dataset show that our method achieves new state-of-the-art performance. Code is available at: https://github.com/3bobo/lttr.
翻译:3D 对象跟踪的两个核心问题是 3D 对象跟踪的特性聚合和相似性计算。 3D 对象跟踪的两个核心问题是 3D 对象跟踪, 特别是使用稀有和无序点云的物体跟踪。 特性融合可以通过包含目标对象信息而提高相似性计算效率。 然而, 多数现有的基于 liDAR 的LIDAR 方法直接使用提取的点云特性来计算相似性, 同时忽略跟踪过程中物体区域的注意变化。 在本文中, 我们提议了一个基于变压器结构的特性融合网络。 从自我注意机制中受益的特性聚合网络, 变压器编码编码能捕捉到点云不同区域的间和内部关系。 KITTI 数据集的综合实验结果显示,我们的方法通过交叉注意、 变压器解码引信功能, 并包含更多目标提示到当前点云特性来计算区域注意值, 这使得类似的计算效率更高。 基于此特性网络, 我们提议了一个端到端点云跟踪3D 对象使用点云的简单而有效的方法。 KITTI 数据集的全面实验结果显示, 我们的方法实现了新的状态- art 的性表现 。