An agent that can understand natural-language instruction and carry out corresponding actions in the visual world is one of the long-term challenges of Artificial Intelligent (AI). Due to multifarious instructions from humans, it requires the agent can link natural language to vision and action in unstructured, previously unseen environments. If the instruction given by human is a navigation task, this challenge is called Visual-and-Language Navigation (VLN). It is a booming multi-disciplinary field of increasing importance and with extraordinary practicality. Instead of focusing on the details of specific methods, this paper provides a comprehensive survey on VLN tasks and makes a classification carefully according the different characteristics of language instructions in these tasks. According to when the instructions are given, the tasks can be divided into single-turn and multi-turn. For single-turn tasks, we further divided them into goal-orientation and route-orientation based on whether the instructions contain a route. For multi-turn tasks, we divided them into imperative task and interactive task based on whether the agent responses to the instructions. This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.
翻译:能够理解自然语言教学和在视觉世界中采取相应行动的代理人,是人工智能(AI)的长期挑战之一。由于人类的多重指示,它要求代理人能够将自然语言与在非结构化、先前不为人知的环境中的视觉和行动联系起来。如果人类的教学是一项导航任务,这个挑战是视觉和语言导航(VLN)。它是一个日益重要且具有不同寻常实用性的蓬勃的多学科领域。本文不注重具体方法的细节,而是对VLN任务进行全面调查,并根据这些任务中语言指示的不同特点进行仔细分类。根据这些指示,任务可以分为单向和多向。对于单向任务,我们进一步根据指示是否包含一条路线而将其划分为目标方向和路线方向。对于多向任务,我们将其分为一项紧迫的任务和交互式任务,其依据是代理人是否对指示作出反应。这一分类使研究人员能够更好地了解具体任务的关键点并确定未来研究的方向。