Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
翻译:从视频中获取的通用运动理解不仅涉及跟踪对象,还涉及其表面变形和移动的方式。这一信息有助于推断3D形状、物理属性和物体相互作用。虽然跟踪长视频片段表面任意物理点的问题引起了一些关注,但到目前为止还没有数据组或评估基准。在本文中,我们首先将问题正式化,将其命名为跟踪任何点(TAP ) 。我们引入了一个配套基准,即TAP-Vid,它由真实世界视频组成,对点轨迹有准确的人文说明,以及合成视频带有完美的地面真相轨道。我们基准建设的中心是一个新型半自动多源管道,它使用光学流量估计来补偿像相机摇晃那样的更简单、短期运动,让注意更难的视频部分。我们在合成数据上验证我们的管道,并提出一个简单的端到端跟踪模型TAP-Net,显示它在接受合成数据培训时超过了我们基准上的所有方法。