TAP-Vid：用于视频中跟踪任意点的基准 (TAP-Vid: A Benchmark for Tracking Any Point in a Video)

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.

翻译：从视频中获取通用运动信息不仅涉及到对象的跟踪，还包括感知它们表面的变形和移动。这些信息非常有用，可以推测3D形状，物理特性和物体相互作用。虽然跟踪较长视频剪辑上任意物理点的问题已经受到了一些关注，但直到现在，还没有用于评估的数据集或基准。在本文中，我们首先形式化了这个问题，将其称为任意点跟踪（TAP）。我们引入了一个伴侣基准TAP-Vid，它由具有准确人工注释点跟踪的实际视频和具有完美地面真实点跟踪的合成视频组成。对于我们基准的构建，其中一个关键是一种新颖的半自动众包管道，它使用光流估计来补偿较易处理的短期运动，例如相机抖动，从而使标注者能够集中处理较难的视频部分。我们在合成数据上验证了我们的管道，并提出了一个简单的端到端点跟踪模型TAP-Net，表明当用合成数据进行训练时，它优于我们基准上所有先前的方法。

相关内容

TAP

关注 811

ACM应用感知TAP(ACM Transactions on Applied Perception)旨在通过发表有助于统一这些领域研究的高质量论文来增强计算机科学与心理学/感知之间的协同作用。该期刊发表跨学科研究，在跨计算机科学和感知心理学的任何主题领域都具有重大而持久的价值。所有论文都必须包含感知和计算机科学两个部分。主题包括但不限于：视觉感知：计算机图形学，科学/数据/信息可视化，数字成像，计算机视觉，立体和3D显示技术。听觉感知：听觉显示和界面，听觉听觉编码，空间声音，语音合成和识别。触觉：触觉渲染，触觉输入和感知。感觉运动知觉：手势输入，身体运动输入。感官感知：感官整合，多模式渲染和交互。官网地址：http://dblp.uni-trier.de/db/journals/tap/