Human language learners are exposed to a trickle of informative, context-sensitive language, but a flood of raw sensory data. Through both social language use and internal processes of rehearsal and practice, language learners are able to build high-level, semantic representations that explain their perceptions. Here, we take inspiration from such processes of "inner speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent speech in embodied behavior. First, we formally pose intra-agent speech as a semi-supervised problem and develop two algorithms that enable visually grounded captioning with little labeled language data. We then experimentally compute scaling curves over different amounts of labeled data and compare the data efficiency against a supervised learning baseline. Finally, we incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world, and show that with as few as 150 additional image captions, intra-agent speech endows the agent with the ability to manipulate and answer questions about a new object without any related task-directed experience (zero-shot). Taken together, our experiments suggest that modelling intra-agent speech is effective in enabling embodied agents to learn new tasks efficiently and without direct interaction experience.
翻译:人类语言学习者暴露于信息丰富、对背景敏感的语言中,但原始感官数据泛滥。通过社交语言的使用和内部演练和实践过程,语言学习者能够建立高层次的语义表,解释他们的看法。在这里,我们从人类“内在语言”的“内在语言”过程(Vygotsky,1934年)中得到灵感,以更好地了解代理人内部语言在体现行为中的作用。首先,我们正式将代理人内部的演讲作为一个半监督的问题提出,并开发两种算法,以便能够用少量的标签语言数据进行有目可查的字幕。然后,我们实验性地根据受监督的学习基线对不同数量的数据进行缩放曲线的计算,并将数据效率进行比较。最后,我们把代理人内部的演讲纳入一个在3D虚拟世界中运作的有内装饰的移动操纵代理器中,并表明,只有多达150个额外的图像说明,即代理人的语音是能够操纵和回答关于新对象的问题,而没有任何相关的任务导向的经验(零照)的代理器内,因此,我们的实验表明,在不直接学习代理人中,能够进行有效的互动。