In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12% for Speaker-Follower, 5% for Env-Dropout, and 4% for AuxRN.
翻译:在视觉和语言导航(VLN)中,研究人员通常在图像网络上进行预先培训的图像编码器,不对该代理商接受培训或测试的环境进行微调。然而,图像网络的培训图像和导航环境中的视图之间的分布变化可能会使图像网络的预培训图像编码器亚优化。因此,在本文中,我们设计了一套结构编码辅助任务(SEA),将导航环境中的数据用于预培训和改进图像编码器。具体地说,我们设计和定制(1) 3D 吉格锯,(2) 可移动性预测,和(3) 预培训图像编码器的例分类。通过严格的布局,我们的SEA预培训功能可以更好地编码场景的结构信息,而图像网络预培训的功能无法正确编码,但对目标导航任务至关重要。SEA预培训的功能可以很容易地插入现有的VLN代理商,而无需进行任何调整。例如,关于测试-REN环境,为ASEA的绝对比例,为我们A+%的SEL代理商成功率,为我们测试-Unsimal arodu a 4-rodual prial present Apration apration 4N apract pract practres 4N apractorporporporporporporporporporporporporporation ex 4N),为我们A 4N 4N 4N 成功代理商实现了5A 的测试的测试-