Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
翻译:从视频中获取通用运动信息不仅涉及到对象的跟踪,还包括感知它们表面的变形和移动。这些信息非常有用,可以推测3D形状,物理特性和物体相互作用。虽然跟踪较长视频剪辑上任意物理点的问题已经受到了一些关注,但直到现在,还没有用于评估的数据集或基准。在本文中,我们首先形式化了这个问题,将其称为任意点跟踪(TAP)。我们引入了一个伴侣基准TAP-Vid,它由具有准确人工注释点跟踪的实际视频和具有完美地面真实点跟踪的合成视频组成。对于我们基准的构建,其中一个关键是一种新颖的半自动众包管道,它使用光流估计来补偿较易处理的短期运动,例如相机抖动,从而使标注者能够集中处理较难的视频部分。我们在合成数据上验证了我们的管道,并提出了一个简单的端到端点跟踪模型TAP-Net,表明当用合成数据进行训练时,它优于我们基准上所有先前的方法。