Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets. Finally, we perform ablation studies illustrating the robustness of our approach. We release our code and demo visualizations in the Supplementary Material.
翻译:时间事件表是人类学习的一个基本方面。 它们可以使我们通过各种感官投入所积累的经验简洁地编码。 另外, 据认为, 它们被按等级排列, 能够有效地代表复杂的长视线经验。 此外, 这些表情是以自我监督的方式获得的。 类似地, 我们在这里提出了一个模型, 从长视场演示数据和相关文字描述中学习时间表, 但没有明确的时间监督。 我们的方法产生一个表情等级, 与最先进的无监控基线( +15.3) 更加接近。 我们的结果可以与像切斯开机、 YouCook2 和 ToutorVQA 数据集等复杂视觉领域的高度监控基线相仿。 最后, 我们进行一个缩略研究, 说明我们的方法的稳健性。 我们在补充材料中发布我们的代码和演示可视性。