Humans' experience of the world is profoundly multimodal from the beginning, so why do existing state-of-the-art language models only use text as a modality to learn and represent semantic meaning? In this paper we review the literature on the role of embodiment and emotion in the interactive setting of spoken dialogue as necessary prerequisites for language learning for human children, including how words in child vocabularies are largely concrete, then shift to become more abstract as the children get older. We sketch a model of semantics that leverages current transformer-based models and a word-level grounded model, then explain the robot-dialogue system that will make use of our semantic model, the setting for the system to learn language, and existing benchmarks for evaluation.
翻译:人类对世界的经验从一开始就是高度多式的,那么为什么现有最先进的语言模式只用文字作为学习和表达语义含义的方式? 在本文件中,我们审查了关于口头对话互动环境中的化身和情感作用的文献,作为人类儿童语言学习的必要先决条件,包括儿童词汇中的文字如何基本具体,然后随着儿童长大而转向更加抽象。我们勾画出一种语义学模型,利用目前的变异器模型和一个以字为基础的模型,然后解释机器人对话系统,它将利用我们的语义模型、语言学习系统的背景以及现有的评估基准。