Capturing the interactions between entities in a structured way plays a central role in world models that flexibly adapt to changes in the environment. Recent works motivate the benefits of models that explicitly represent the structure of interactions and formulate the problem as discovering local causal structures. In this work, we demonstrate that reliably capturing these relationships in complex settings remains challenging. To remedy this shortcoming, we postulate that sparsity is a critical ingredient for the discovery of such local structures. To this end, we present the SPARse TrANsformer World model (SPARTAN), a Transformer-based world model that learns context-dependent interaction structures between entities in a scene. By applying sparsity regularisation on the attention patterns between object-factored tokens, SPARTAN learns sparse, context-dependent interaction graphs that accurately predict future object states. We further extend our model to adapt to sparse interventions with unknown targets in the dynamics of the environment. This results in a highly interpretable world model that can efficiently adapt to changes. Empirically, we evaluate SPARTAN against the current state-of-the-art in object-centric world models in observation-based environments and demonstrate that our model can learn local causal graphs that accurately reflect the underlying interactions between objects, achieving significantly improved few-shot adaptation to dynamics changes, as well as robustness against distractors.
翻译:以结构化方式捕捉实体间的交互,在能够灵活适应环境变化的世界模型中起着核心作用。近期研究强调了显式表示交互结构模型的优势,并将该问题形式化为发现局部因果结构。本工作中,我们证明在复杂场景中可靠捕捉这些关系仍具挑战性。为弥补这一不足,我们提出稀疏性是发现此类局部结构的关键要素。为此,我们提出稀疏Transformer世界模型(SPARTAN),这是一种基于Transformer的世界模型,能够学习场景中实体间上下文依赖的交互结构。通过对基于对象分解的token之间的注意力模式施加稀疏正则化,SPARTAN能够学习稀疏的、上下文依赖的交互图,从而准确预测未来对象状态。我们进一步扩展模型,使其能够适应环境动力学中目标未知的稀疏干预。这产生了一个高度可解释的世界模型,能够高效适应环境变化。实证研究中,我们在基于观测的环境中将SPARTAN与当前最先进的以对象为中心的世界模型进行比较,结果表明我们的模型能够学习准确反映对象间底层交互的局部因果图,在动态变化的少样本适应能力上取得显著提升,并表现出对干扰物的强鲁棒性。