Lexical semantics and cognitive science point to affordances (i.e. the actions that objects support) as critical for understanding and representing nouns and verbs. However, study of these semantic features has not yet been integrated with the "foundation" models that currently dominate language representation research. We hypothesize that predictive modeling of object state over time will result in representations that encode object affordance information "for free". We train a neural network to predict objects' trajectories in a simulated interaction and show that our network's latent representations differentiate between both observed and unobserved affordances. We find that models trained using 3D simulations from our SPATIAL dataset outperform conventional 2D computer vision models trained on a similar task, and, on initial inspection, that differences between concepts correspond to expected features (e.g., roll entails rotation). Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
翻译:逻辑语义学和认知科学指出,对于理解和代表名词和动词至关重要(即反对支持的行动),这些语义特征的研究尚未与目前主导语言代表性研究的“基础”模型相结合。我们假设,对物体状态的预测模型将逐渐导致将物体信息“免费”编码为“免费”的表征。我们训练了一个神经网络,以在模拟互动中预测物体的轨迹,并显示我们网络的潜在表现方式区分了观察到的和无法观察到的承载方式。我们发现,使用3D模拟的3D模拟模型所训练的模型超越了在类似任务上受过训练的常规 2D 计算机视觉模型,在初步检查后,各种概念之间的差异将与预期的特征相对应(例如,滚动意味着旋转)。我们的结果表明,现代深层次学习方法可以与传统正式的语言表达方式相结合。