Empowering agents with a compositional understanding of their environment is a promising next step toward solving long-horizon planning problems. On the one hand, we have seen encouraging progress on variational inference algorithms for obtaining sets of object-centric latent representations ("slots") from unstructured scene observations. On the other hand, generating scenes from slots has received less attention, in part because it is complicated by the lack of a canonical object order. A canonical object order is useful for learning the object correlations necessary to generate physically plausible scenes similar to how raster scan order facilitates learning pixel correlations for pixel-level autoregressive image generation. In this work, we address this lack by learning a fixed object order for a hierarchical variational autoencoder with a single level of autoregressive slots and a global scene prior. We cast autoregressive slot inference as a set-to-sequence modeling problem. We introduce an auxiliary loss to train the slot prior to generate objects in a fixed order. During inference, we align a set of inferred slots to the object order obtained from a slot prior rollout. To ensure the rolled out objects are meaningful for the given scene, we condition the prior on an inferred global summary of the input. Experiments on compositional environments and ablations demonstrate that our model with global prior, inference with aligned slot order, and auxiliary loss achieves state-of-the-art sample quality.
翻译:赋予代理人权能,使其对自身环境具有构成性理解,这是解决长正方位规划问题的一个有希望的下一步。 一方面,我们看到从结构化的场景观测中获得一系列物体中心潜表(“ Slots” )的变式推算算算法取得了令人鼓舞的进展。 另一方面,从空格生成场景受到的关注较少, 部分原因是由于缺乏一个明性天体秩序, 使得该天体的变形变形变形变得复杂。 一种典型天体顺序有助于了解物体的关联性, 以产生与光学扫描顺序类似的物理可见场景。 一方面, 我们看到在从像素级自动递减图像生成中学习像素级样的比喻。 在这项工作中, 我们通过学习一个固定的天体变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变色的固定目标变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形。 我们的变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变色,, 变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变形变