We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. In contrast to the previous keypoint-based works, our method extracts meaningful and consistent regions, describing locations, shape, and pose. The regions correspond to semantically relevant and distinct object parts, that are more easily detected in frames of the driving video. To force decoupling of foreground from background, we model non-object related global motion with an additional affine transformation. To facilitate animation and prevent the leakage of the shape of the driving object, we disentangle shape and pose of objects in the region space. Our model can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks. We present a challenging new benchmark with high-resolution videos and show that the improvement is particularly pronounced when articulated objects are considered, reaching 96.6% user preference vs. the state of the art.
翻译:我们提出由不同部分构成的动画分解物体的新动画演示。 我们的方法完全不受监督地识别物体部件,在驱动视频中跟踪它们,并通过考虑其主要轴来推断其动作。 与以前以关键点为基础的作品相比,我们的方法提取了有意义和一致的区域,描述位置、形状和姿势。 区域与在驱动视频框中更容易检测到的具有语义相关性和不同对象部分相对应。 为了强制将前景与背景脱钩,我们用一个额外的线形变形来模拟与非对象相关的全球运动。 为了方便动画和防止驱动物体形状的泄漏,我们分解了区域空间物体的形状和形状。 我们的模型可以将各种物体固定起来,大大超过现有基准的以往方法。 我们用高分辨率视频提出了一个具有挑战性的新基准,并表明在考虑表达的物体时,改进特别明显,达到96.6%的用户偏好度与艺术状态。