Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.
翻译:Abstract: 最近,多模式机器学习和人工智能(AI)的领域取得了新的进展,这导致了在计算机视觉、自然语言处理和机器人技术交叉领域上的一系列具有挑战性的任务的发展。虽然许多方法和以前的调查追求已经表征了这些维度中的一个或两个,但在所有三个领域的中心还没有进行全面分析。此外,即使考虑这些主题的组合,也更多地集中于描述例如当前的架构方法,而不仅仅是说明这个领域的高级挑战和机会。在本次调查论文中,我们讨论了实体视觉语言规划(EVLP)任务,这是一系列突出的实体导航和操作问题,共同利用计算机视觉和自然语言进行物理环境交互。我们提出了一个分类法来统一这些任务,并对用于EVLP任务的当前和新的算法方法、度量标准、模拟器和数据集进行了深入分析和比较。最后,我们提出了我们认为新的EVLP工作应该寻求解决的核心挑战,并倡导任务构建,以便实现模型通用性并促进实现现实世界的部署。