Causal discovery is at the core of human cognition. It enables us to reason about the environment and make counterfactual predictions about unseen scenarios that can vastly differ from our previous experiences. We consider the task of causal discovery from videos in an end-to-end fashion without supervision on the ground-truth graph structure. In particular, our goal is to discover the structural dependencies among environmental and object variables: inferring the type and strength of interactions that have a causal effect on the behavior of the dynamical system. Our model consists of (a) a perception module that extracts a semantically meaningful and temporally consistent keypoint representation from images, (b) an inference module for determining the graph distribution induced by the detected keypoints, and (c) a dynamics module that can predict the future by conditioning on the inferred graph. We assume access to different configurations and environmental conditions, i.e., data from unknown interventions on the underlying system; thus, we can hope to discover the correct underlying causal graph without explicit interventions. We evaluate our method in a planar multi-body interaction environment and scenarios involving fabrics of different shapes like shirts and pants. Experiments demonstrate that our model can correctly identify the interactions from a short sequence of images and make long-term future predictions. The causal structure assumed by the model also allows it to make counterfactual predictions and extrapolate to systems of unseen interaction graphs or graphs of various sizes.
翻译:原因发现是人类认知的核心。 它让我们能够解释环境, 并对与我们以往不同的经验相去甚远的不可见情景做出反事实预测。 我们考虑从视频中以端到端的因果发现任务, 而不监督地面图形结构。 特别是, 我们的目标是发现环境和物体变量之间的结构依赖性: 推断对动态系统行为产生因果关系的相互作用的类型和强度。 我们的模式包括 (a) 感知模块, 从图像中提取一个具有意义意义和时间一致性的关键点代表, (b) 用于确定被检测到的关键点所引发的图表分布的推断模块, 以及 (c) 一个动态模块, 可以通过调整推断图形来预测未来。 我们假设了不同配置和环境条件之间的结构, 即来自基础系统未知的图形干预数据; 因此, 我们希望能在没有明确的干预下发现正确的、 短期因果图表。 我们在一个计划式多机型互动环境中评估了我们的方法, 以及包含不同形状的模型结构的假设, 比如, 能够正确显示我们从外表层和直观的图像结构中, 能够显示我们所假设的外表层和直观的直观的图像。