The speaker-follower models have proven to be effective in vision-and-language navigation, where a speaker model is used to synthesize new instructions to augment the training data for a follower navigation model. However, in many of the previous methods, the generated instructions are not directly trained to optimize the performance of the follower. In this paper, we present \textsc{foam}, a \textsc{Fo}llower-\textsc{a}ware speaker \textsc{M}odel that is constantly updated given the follower feedback, so that the generated instructions can be more suitable to the current learning state of the follower. Specifically, we optimize the speaker using a bi-level optimization framework and obtain its training signals by evaluating the follower on labeled data. Experimental results on the Room-to-Room and Room-across-Room datasets demonstrate that our methods can outperform strong baseline models across settings. Analyses also reveal that our generated instructions are of higher quality than the baselines.
翻译:发言者追随者模型已证明在视觉和语言导航方面是有效的,使用一个演讲者模型来综合新的指示,以扩大跟踪者导航模型的培训数据。然而,在以前的许多方法中,所产生的指示没有直接训练来优化跟踪者的工作表现。在本文中,我们展示了\ textsc{foam},一个\textsc{Fo}llower-textsc{Fo}llower-textsc{a}软件用主讲者演讲者 \ textsc{M}odel,根据跟踪者的反馈不断更新,这样产生的指示就更适合跟踪者目前的学习状态。具体地说,我们利用双级优化框架优化演讲者,并通过评价标签数据的追随者来获得其培训信号。在室对室和室跨室数据集上的实验结果表明,我们的方法可以超越各种环境的强基准模型。分析还表明,我们生成的指示的质量高于基线。