Human infants learn language while interacting with their environment in which their caregivers may describe the objects and actions they perform. Similar to human infants, artificial agents can learn language while interacting with their environment. In this work, first, we present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario. Building on our previous Paired Variational Autoencoders (PVAE) model, we demonstrate the superiority of the variational autoencoder over standard autoencoders by experimenting with cubes of different colours, and by enabling the production of alternative vocabularies. Additional experiments show that the model's channel-separated visual feature extraction module can cope with objects of different shapes. Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model, i.e., Bidirectional Encoder Representations from Transformers (BERT), enabling the model to go beyond comprehending only the predefined descriptions that the network has been trained on; the recognition of action descriptions generalises to unconstrained natural language as the model becomes capable of understanding unlimited variations of the same descriptions. Our experiments suggest that using a pretrained language model as the language encoder allows our approach to scale up for real-world scenarios with instructions from human users.
翻译:人类婴儿在与其护理者进行互动时学习语言,他们可以与环境互动,在与环境互动时学习语言。与人类婴儿相似,人工剂可以在与环境互动时学习语言。在这项工作中,首先,我们展示一个神经模型,在简单的天体操纵情景中双向结合机器人动作及其语言描述。以我们先前的变换自动读数模型(PVAE)为基础,我们展示变异自动编码器优于标准自动编码器的优越性,我们通过实验不同颜色的立方体和帮助生产替代词汇来让模型学习。其他实验表明,该模型的频道分隔视觉特征提取模块可以应对不同形状的物体。接下来,我们推出PVAE-BERT,该模型为模型配备了预先训练的大型语言模型(PVE),使该模型能够超越对网络所培训的预先定义的描述;承认行动描述的一般特征,使其不固定的自然特征模块能够应对不同形状的物体。我们用模型来进行不拘束性的语言模型,从而能够理解真正的语言模型的变形。