Distributional models learn representations of words from text, but are criticized for their lack of grounding, or the linking of text to the non-linguistic world. Grounded language models have had success in learning to connect concrete categories like nouns and adjectives to the world via images and videos, but can struggle to isolate the meaning of the verbs themselves from the context in which they typically occur. In this paper, we investigate the extent to which trajectories (i.e. the position and rotation of objects over time) naturally encode verb semantics. We build a procedurally generated agent-object-interaction dataset, obtain human annotations for the verbs that occur in this data, and compare several methods for representation learning given the trajectories. We find that trajectories correlate as-is with some verbs (e.g., fall), and that additional abstraction via self-supervised pretraining can further capture nuanced differences in verb meaning (e.g., roll vs. slide).
翻译:分布式模型从文字中学习文字的表达,但因其缺乏基础或文字与非语言世界的链接而受到批评。 定位式语言模型成功地学习了通过图像和视频将名词和形容词等具体类别与世界连接起来,但能够努力将动词本身的含义与通常发生动词的背景区分开来。 在本文中,我们调查了动词(即物体在时间上的位置和旋转)自然编码动词词词词词义的自然编码(即物体在时间上的定位和旋转)程度。 我们建立了一个程序上产生的代理词词词词词词词词词词词和数据中出现的动词词词的人类说明,并比较了轨迹中几种代表学方法。 我们发现,轨迹与某些动词(例如,跌落)相关,而通过自我监督的预演练来增加的抽象能进一步捕捉动词义(例如,滚动和幻灯片)中的细微差异。