使用变换器的地面空间时空语言 (Grounding Spatio-Temporal Language with Transformers)

Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.

翻译：语言是外部世界的界面。语言必须根植于其它感官模式。虽然有广泛的文献研究机器如何学习有根语言, 但如何学习时空语言概念的话题仍然基本上没有被探索。为了在这方面取得进展, 我们在这里引入了一个新型的时空语言基础任务, 目标是了解一个有内含代理的行为痕迹的含义。这是通过培训一个真理函数来实现的, 该功能预测描述是否与特定观察历史相匹配。描述涉及过去和现在的超时性动态, 以及场景中物体的时空引用。为了研究建筑偏见在这项任务中的作用, 我们培训了包括多式联运变异器结构在内的若干模型; 后者在空间和时间的文字和物体之间进行不同的关注度计算。我们测试了两类开源模型:1) 通俗化的句子; 2) 直译为直译本。我们观察了过去和现在的时空界的延时空的时空状态, 我们观察到, 维护目标的时空特性的轨迹, 正在细化, 细微地测量我们的工具的运行过程的轨迹, 。