To help agents reason about scenes in terms of their building blocks, we wish to extract the compositional structure of any given scene (in particular, the configuration and characteristics of objects comprising the scene). This problem is especially difficult when scene structure needs to be inferred while also estimating the agent's location/viewpoint, as the two variables jointly give rise to the agent's observations. We present an unsupervised variational approach to this problem. Leveraging the shared structure that exists across different scenes, our model learns to infer two sets of latent representations from RGB video input alone: a set of "object" latents, corresponding to the time-invariant, object-level contents of the scene, as well as a set of "frame" latents, corresponding to global time-varying elements such as viewpoint. This factorization of latents allows our model, SIMONe, to represent object attributes in an allocentric manner which does not depend on viewpoint. Moreover, it allows us to disentangle object dynamics and summarize their trajectories as time-abstracted, view-invariant, per-object properties. We demonstrate these capabilities, as well as the model's performance in terms of view synthesis and instance segmentation, across three procedurally generated video datasets.
翻译:为了帮助代理商从构件的角度理解场景,我们希望从任何特定场景的构成结构(特别是构成场景的物体的配置和特征)中提取出一个图案结构。当需要推断场景结构时,这一问题特别困难,同时需要估算代理商的位置/视图点,因为这两个变量共同导致代理商的观察。我们对这一问题提出了一种未经监督的变异方法。利用不同场景的共享结构,我们的模型从 RGB 视频输入中单独推断出两组潜在代表:一组“对象”潜伏,与场景的时间变量、对象级内容相对应,以及一套“框架”潜伏,与全球时间变化要素如观点相对应。这种潜伏因素使得我们的模型SIMONTE能够以不依赖视角的偏向方式代表对象属性。此外,它允许我们从 RGB 视频输入中分离出两组潜在表达的图象:一组“对象”潜伏性,与场景中的时间变量、视图、对象级、对象级内容相对,以及一系列“框架”潜值,与视角相对。我们用这些功能展示了模拟合成过程段的功能。