Understanding the latent causal factors of a dynamical system from visual observations is a crucial step towards agents reasoning in complex environments. In this paper, we propose CITRIS, a variational autoencoder framework that learns causal representations from temporal sequences of images in which underlying causal factors have possibly been intervened upon. In contrast to the recent literature, CITRIS exploits temporality and observing intervention targets to identify scalar and multidimensional causal factors, such as 3D rotation angles. Furthermore, by introducing a normalizing flow, CITRIS can be easily extended to leverage and disentangle representations obtained by already pretrained autoencoders. Extending previous results on scalar causal factors, we prove identifiability in a more general setting, in which only some components of a causal factor are affected by interventions. In experiments on 3D rendered image sequences, CITRIS outperforms previous methods on recovering the underlying causal variables. Moreover, using pretrained autoencoders, CITRIS can even generalize to unseen instantiations of causal factors, opening future research areas in sim-to-real generalization for causal representation learning.
翻译:从视觉观察中了解动态系统的潜在因果关系因素,是走向复杂环境中代理推理的关键一步。在本文中,我们提议CITRIS,这是一个变式自动编码框架,从可能干预基本因果关系因素的图像的时间序列中学习因果表现。与最近的文献相反,CITRIS利用时间性和观察干预目标来查明3D旋转角度等大规模和多层面因果因素。此外,通过引入正常流动,CITRIS可以很容易地扩展至已经受过训练的自动校准者获得的杠杆作用和分解表态。扩大以前关于标定因果因素的结果,我们证明在更笼统的环境下可以识别因果因素的某些组成部分受到干预的影响。在3D变形序列的实验中,CITRIS超越了以前恢复基本因果变量的方法。此外,使用预先训练的自动校准器,CITRIS甚至可以概括为未知的因果因素的瞬间即解析,打开未来研究领域,为因果关系学。