We present GATSBI, a generative model that can transform a sequence of raw observations into a structured latent representation that fully captures the spatio-temporal context of the agent's actions. In vision-based decision-making scenarios, an agent faces complex high-dimensional observations where multiple entities interact with each other. The agent requires a good scene representation of the visual observation that discerns essential components and consistently propagates along the time horizon. Our method, GATSBI, utilizes unsupervised object-centric scene representation learning to separate an active agent, static background, and passive objects. GATSBI then models the interactions reflecting the causal relationships among decomposed entities and predicts physically plausible future states. Our model generalizes to a variety of environments where different types of robots and objects dynamically interact with each other. We show GATSBI achieves superior performance on scene decomposition and video prediction compared to its state-of-the-art counterparts.
翻译:我们提出了《服贸总协定倡议》,这是一个可以将一系列原始观测转换成结构化潜在代表的基因模型,它能够充分捕捉代理人行动的时空环境。在基于愿景的决策设想中,代理人面临复杂的高层次观测,其中多个实体相互作用。该代理人需要很好地展示视觉观测,这种观测能够辨别基本组成部分,并在时间跨度上不断传播。我们的方法,即《服贸总协定》,利用不受监督的以物体为中心的场面代表学习来分离一个活跃的代理人、静态背景和被动对象。《服贸总协定》然后模拟反映分解的实体之间因果关系的相互作用,并预测实际可行的未来状态。我们的模式概括了不同类型机器人和物体之间动态互动的各种环境。我们展示了《服贸总协定》在现场分解和视频预测方面比其最先进的对应方表现优。