Vision-and-language navigation (VLN) is a challenging task that requires an agent to navigate in real-world environments by understanding natural language instructions and visual information received in real-time. Prior works have implemented VLN tasks on continuous environments or physical robots, all of which use a fixed camera configuration due to the limitations of datasets, such as 1.5 meters height, 90 degrees horizontal field of view (HFOV), etc. However, real-life robots with different purposes have multiple camera configurations, and the huge gap in visual information makes it difficult to directly transfer the learned navigation model between various robots. In this paper, we propose a visual perception generalization strategy based on meta-learning, which enables the agent to fast adapt to a new camera configuration with a few shots. In the training phase, we first locate the generalization problem to the visual perception module, and then compare two meta-learning algorithms for better generalization in seen and unseen environments. One of them uses the Model-Agnostic Meta-Learning (MAML) algorithm that requires a few shot adaptation, and the other refers to a metric-based meta-learning method with a feature-wise affine transformation layer. The experiment results show that our strategy successfully adapts the learned navigation model to a new camera configuration, and the two algorithms show their advantages in seen and unseen environments respectively.
翻译:视觉和语言导航(VLN)是一项具有挑战性的任务,它要求一个代理人通过理解自然语言指令和实时收到的视觉信息在现实世界环境中航行。先前的工程已经在连续环境或物理机器人上执行了视觉导航任务,所有这些机器人都使用固定的相机配置,因为数据集有限,如1.5米高、90度水平视野(HFOV)等。然而,具有不同目的的真实寿命机器人具有多个相机配置,视觉信息的巨大差距使得难以在各种机器人之间直接传输所学的导航模型。在本文中,我们提出了一个基于元学习的视觉感知一般化战略,使代理人能够快速适应带有几发镜头的新摄影机配置。在培训阶段,我们首先将一般化问题定位到视觉感知模块,然后比较两种元学习算法,以便在视觉和看不见的环境中更好地普及。其中一种模型-遗传元学习算法需要很少的调整,而另一个则提到一种基于计量的元感知一般认知战略,使这些机器人能够快速适应新的摄影机结构。在功能和视觉演算法演算中,分别展示了一种成功的模型和历史演化环境的演算。