3D 3D 点云单一物体跟踪变压器网络 (3D Siamese Transformer Network for Single Object Tracking on Point Clouds)

Siamese network based trackers formulate 3D single object tracking as cross-correlation learning between point features of a template and a search area. Due to the large appearance variation between the template and search area during tracking, how to learn the robust cross correlation between them for identifying the potential target in the search area is still a challenging problem. In this paper, we explicitly use Transformer to form a 3D Siamese Transformer network for learning robust cross correlation between the template and the search area of point clouds. Specifically, we develop a Siamese point Transformer network to learn shape context information of the target. Its encoder uses self-attention to capture non-local information of point clouds to characterize the shape information of the object, and the decoder utilizes cross-attention to upsample discriminative point features. After that, we develop an iterative coarse-to-fine correlation network to learn the robust cross correlation between the template and the search area. It formulates the cross-feature augmentation to associate the template with the potential target in the search area via cross attention. To further enhance the potential target, it employs the ego-feature augmentation that applies self-attention to the local k-NN graph of the feature space to aggregate target features. Experiments on the KITTI, nuScenes, and Waymo datasets show that our method achieves state-of-the-art performance on the 3D single object tracking task.

翻译：以 Siamese 网络为基础的跟踪器将3D 单个对象的跟踪作为模板点特征和搜索区域之间的交叉关系学习。由于模板和搜索区域在跟踪过程中出现巨大的外观差异, 如何在搜索区域中学习识别潜在目标的紧密交叉关系仍然是一个棘手的问题。在本文件中, 我们明确使用变换器来组建 3D siamese 变异器网络, 学习模板和点云搜索区域之间的牢固交叉关系。具体地说, 我们开发了 siamse 点变异器网络, 以通过交叉关注来构建目标的上下文信息。其编码器使用自我注意来捕捉点云的非本地信息来描述对象的形状信息, 而解码器则利用交叉注意来增加歧视点特性。之后, 我们开发了一个迭接的 Commission- commission- transporterger 网络, 学习模板和点云搜索区域搜索区域之间的牢固交叉相互关系。它设计了跨性增强性增强性能将模板与搜索区域的潜在目标联系起来。为了进一步增强潜在目标目标目标, 它利用了潜在的目标目标, 它利用了自我跟踪功能, 它利用了自我定位- 将自我定位定位定位定位定位定位定位的图像- 和自我定位- 性定位- 性格- 性格- 来显示工具- 自我定位- 性格- 性格- 性格- 性格- 性格- 性格- 性- 性格- 性格- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能- 性能-