Compared with traditional RGB-only visual tracking, few datasets have been constructed for RGB-D tracking. In this paper, we propose ARKitTrack, a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes, we also annotate this dataset with 123.9K pixel-level target masks. Besides, the camera intrinsic and camera pose of each frame are provided for future developments. To demonstrate the potential usefulness of this dataset, we further present a unified baseline for both box-level and pixel-level tracking, which integrates RGB features with bird's-eye-view representations to better explore cross-modality 3D geometry. In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts. The code and dataset is available at https://arkittrack.github.io.
翻译:与传统的RGB视觉跟踪相比,创建RGB-D跟踪数据集的数据集很少。在本文中,我们提出了ARKitTrack,这是一个新的RGB-D跟踪数据集,用于由安装在Apple的iPhone和iPad上的消费级LiDAR扫描仪捕获的静态和动态场景。ARKitTrack总共包含300个RGB-D序列,455个目标和229.7K个视频帧。除了提供包围框注释和帧级属性外,我们还使用123.9K个像素级目标掩码对这个数据集进行了注释。此外,每个帧的相机内部和相机姿态也提供了未来的开发。为了展示该数据集的潜在用途,我们还提出了一个统一的基线,用于盒级和像素级跟踪,它将RGB特征与鸟瞰图表示相结合,以更好地探索跨模态3D几何形状。深入的实证分析已经验证了ARKitTrack数据集可以极大地促进RGB-D跟踪,并且所提出的基线方法与现有技术相比具有竞争优势。代码和数据集在 https://arkittrack.github.io 上可用。