Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.
翻译:在导航等各种情况下,可以遵循语言指示的代理机构预计会有用。然而,培训神经网络代理机构需要多种配对轨迹和语言。本文件提议在以下任务中采用多式联运变异模型,用于半监督学习。模型学习配对数据的共同表述,并通过代表机构重建未受监督的数据,使半监督学习成为可能。在将模型应用到顺序到顺序的任务(包括以下指示)方面的主要挑战是学习可变长字型数据的共同表述,并纳入关注机制。为解决这些问题,本文件提议采用新的网络架构,以吸收多式联运数据序列长度的差异。此外,为了进一步改进绩效,本文展示了如何将基于变异模型的方法与现有的半监督方法(即语音跟踪模型模型)相结合,并提出了一种规范化术语,用未受控制的轨迹来改进推断。对 Baby Air 和室到罗姆(R2R)环境的实验表明,拟议的方法通过将2号演讲人模型的操作改进了2号教学的绩效。