What would be the effect of locally poking a static scene? We present an approach that learns naturally-looking global articulations caused by a local manipulation at a pixel level. Training requires only videos of moving objects but no information of the underlying manipulation of the physical scene. Our generative model learns to infer natural object dynamics as a response to user interaction and learns about the interrelations between different object body regions. Given a static image of an object and a local poking of a pixel, the approach then predicts how the object would deform over time. In contrast to existing work on video prediction, we do not synthesize arbitrary realistic videos but enable local interactive control of the deformation. Our model is not restricted to particular object categories and can transfer dynamics onto novel unseen object instances. Extensive experiments on diverse objects demonstrate the effectiveness of our approach compared to common video prediction frameworks. Project page is available at https://bit.ly/3cxfA2L .
翻译:本地浏览静态场景的效果是什么? 我们展示了一种方法, 学习本地操作在像素水平上产生的自然景象。 培训仅需要移动对象的视频, 但没有物理场景基本操作的信息。 我们的基因模型学会推断自然物体动态, 作为对用户互动的一种反应, 并了解不同对象体区域之间的相互关系。 鉴于一个物体的静态图像和像素的本地捕捉, 该方法然后预测该物体将如何随着时间变形。 与现有的视频预测工作相比, 我们不合成任意的、现实的视频,而是能够对变形进行本地互动控制。 我们的模型不限于特定对象类别, 并且可以将动态传输到新的看不见物体实例。 对不同物体的广泛实验表明我们的方法与普通视频预测框架相比的有效性。 项目网页可在 https://bit. ly/3cxfA2L 上查阅 。