Current approaches to video analysis of human motion focus on raw pixels or keypoints as the basic units of reasoning. We posit that adding higher-level motion primitives, which can capture natural coarser units of motion such as backswing or follow-through, can be used to improve downstream analysis tasks. This higher level of abstraction can also capture key features, such as loops of repeated primitives, that are currently inaccessible at lower levels of representation. We therefore introduce Motion Programs, a neuro-symbolic, program-like representation that expresses motions as a composition of high-level primitives. We also present a system for automatically inducing motion programs from videos of human motion and for leveraging motion programs in video synthesis. Experiments show that motion programs can accurately describe a diverse set of human motions and the inferred programs contain semantically meaningful motion primitives, such as arm swings and jumping jacks. Our representation also benefits downstream tasks such as video interpolation and video prediction and outperforms off-the-shelf models. We further demonstrate how these programs can detect diverse kinds of repetitive motion and facilitate interactive video editing.
翻译:目前人类运动的视频分析方法以原始像素或关键点为重点,作为基本推理单位。我们假设,增加高层次运动原始物,可以捕捉自然粗糙的运动单位,如后向或后向,可以用来改进下游分析任务。这种更高层次的抽象还能够捕捉关键特征,如反复的原始物循环,目前在较低代表级别上是无法获得的。因此,我们引入了运动程序,即神经同步、类似程序的代表,以高层次原始物的构成来表达动作。我们还提供了一个系统,用于自动从人类运动的视频中引入运动程序,并利用视频合成中的运动程序。实验显示,运动程序可以准确地描述多种多样的人类运动,而推断的方案含有具有内涵意义的运动原始物,如臂摇动和跳动。我们的代表还有利于下游任务,例如视频内插和视频预测,以及超越现成的模型。我们进一步展示了这些方案如何探测各种重复性运动并促进交互式视频编辑。