Sequential modelling of high-dimensional data is an important problem that appears in many domains including model-based reinforcement learning and dynamics identification for control. Latent variable models applied to sequential data (i.e., latent dynamics models) have been shown to be a particularly effective probabilistic approach to solve this problem, especially when dealing with images. However, in many application areas (e.g., robotics), information from multiple sensing modalities is available -- existing latent dynamics methods have not yet been extended to effectively make use of such multimodal sequential data. Multimodal sensor streams can be correlated in a useful manner and often contain complementary information across modalities. In this work, we present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics. Using synthetic and real-world datasets from a multimodal robotic planar pushing task, we demonstrate that our approach leads to significant improvements in prediction and representation quality. Furthermore, we compare to the common learning baseline of concatenating each modality in the latent space and show that our principled probabilistic formulation performs better. Finally, despite being fully self-supervised, we demonstrate that our method is nearly as effective as an existing supervised approach that relies on ground truth labels.
翻译:在许多领域,包括基于模型的强化学习和动态识别,以控制为目的的动态识别,对高维数据进行序列建模是一个重要问题。在一系列领域,对连续数据(即潜在动态模型)应用的原始变量模型已证明是解决这一问题的一种特别有效的概率方法,特别是在处理图像时。然而,在许多应用领域(例如机器人),从多种遥感模式获得的信息 -- -- 现有的潜伏动态方法尚未推广到有效利用这种多式联运连续数据。多式传感器流可以以有益的方式相互关联,并经常包含各种模式的补充信息。在这项工作中,我们提出了一个自我监督的基因化建模框架,以共同学习多式联运数据和相关动态的概率潜在状态。我们利用合成和现实世界数据集,从多式联运机器人计划推进任务中发现,我们的方法导致预测和代表质量的显著改善。此外,我们比较了在潜在空间对每一种模式进行配置的共同学习基线,并表明我们有原则的概率化的概率化配方方法表现得更好。最后,尽管我们完全以自我监督的地面标签方式为基础,但我们几乎可以依靠一种有效的地面标签。