Estimating the target extent poses a fundamental challenge in visual object tracking. Typically, trackers are box-centric and fully rely on a bounding box to define the target in the scene. In practice, objects often have complex shapes and are not aligned with the image axis. In these cases, bounding boxes do not provide an accurate description of the target and often contain a majority of background pixels. We propose a segmentation-centric tracking pipeline that not only produces a highly accurate segmentation mask, but also internally works with segmentation masks instead of bounding boxes. Thus, our tracker is able to better learn a target representation that clearly differentiates the target in the scene from background content. In order to achieve the necessary robustness for the challenging tracking scenario, we propose a separate instance localization component that is used to condition the segmentation decoder when producing the output mask. We infer a bounding box from the segmentation mask, validate our tracker on challenging tracking datasets and achieve the new state of the art on LaSOT with a success AUC score of 69.7%. Since most tracking datasets do not contain mask annotations, we cannot use them to evaluate predicted segmentation masks. Instead, we validate our segmentation quality on two popular video object segmentation datasets.
翻译:估计目标范围在视觉对象跟踪中构成一个根本性的挑战。 通常, 跟踪器以箱为中心, 完全依赖一个捆绑框来定义现场的目标。 实际上, 物体通常具有复杂的形状, 并且与图像轴不匹配。 在这些情况下, 捆绑框不提供目标的准确描述, 并且往往包含大部分背景像素。 我们建议一个以区域为中心的追踪管道, 不仅产生高度准确的分隔面罩, 而且还在内部使用分隔面罩, 而不是捆绑框。 因此, 我们的跟踪器能够更好地学习一个目标代表, 将场景中的目标与背景内容区别开来。 为了实现具有挑战性的跟踪场景中所需的稳健性, 我们建议一个单独的实例本地化部分部分部分组件在生产输出面罩时用来调节分解的分解器。 我们从分割面罩中推导出一个捆绑框, 验证我们的追踪器, 并实现关于LaSOT的艺术的新状态, 成功分数为69.7 %。 由于大多数的跟踪数据设置并不包含具有挑战性的缩图段质量, 因此我们无法使用两个部分来评估它们。